Get the default webpage file name - ruby

Is there a way to get the default web page of a given url in ruby?
I'm looking for a function like
get_indexpage_for("www.example.com")
with a result that's equal to something like
'index.html' or 'index.php' or 'index.htm' or ...
Even the HTTP-header doesn't contain this information and i've also looked at the Net::HTTP class but i couldn't find a solution.
Can someone pls help?

This will do it if there actually is a url that can be discerned. It works like a charm on some pages and not on others.
It should work on the url I've used in my example...
require 'mechanize'
require 'pp'
agent = Mechanize.new
login_url = 'http://www.reports.rtui.com'
page = agent.get(login_url)
puts page.uri
index.html is the standard default, but if you go to google.com they don't appear to have an index page. Instead it runs more like an application, serving content as its requested.
I'm no pro by any measure, but based on my research there doesn't seem to be one magic bullet that does what you want. At least, not one that's obvious. It really depends on the page itself.

Related

How to get the current URL for a HTML page

I am scraping a website using Nokogiri. This particular website deals with absolute URLs differently.
If I give it a URL like:
page = Nokogiri::HTML(open(link, :allow_redirections => :all))
it will redirect to the HTTPS version, and also redirect to the long version of the URL. For example, a link like
http://www.website.com/name
turns into
http://www.website.com/other-area/name
This is fine and doesn't really affect my scraper, however, there are certain edge-cases where, if I can tell my scraper what the current URL is, I can avoid them.
After I pass in the above link to my page variable, how can I get the current URL of that page after the redirect happens?
I'm assuming you're using the open_uri_redirections gem because :allow_redirections is not necessary in Ruby 2.4+.
Save the result of OpenURI's open:
require 'open-uri'
r = open('http://www.google.com/gmail')
r.base_uri
# #<URI::HTTPS https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1#>
page = Nokogiri::HTML(r)
Use Mechanize, then you can do:
agent = Mechanize.new
page = agent.get url
puts page.uri # this will be the redirected url

Ruby - How can I follow a .php link through a request and get the redirect link?

Firstly I want to make clear that I am not familiar with Ruby, at all.
I'm building a Discord Bot in Go as an exercise, the bot fetches UrbanDictionary definitions and sends them to whoever asked in Discord.
However, UD doesn't have an official API, and so I'm using this. It's an Heroku App written in Ruby. From what I understood, it scrapes the UD page for the given search.
I want to add random to my Bot, however the API doesn't support it and I want to add it.
As I see it, it's not hard since http://www.urbandictionary.com/random.php only redirects you to a normal link of the site. This way if I can follow the link to the "normal" one, get the link and pass it on the built scraper it can return just as any other link.
I have no idea how to follow it and I was hoping I could get some pointers, samples or whatsoever.
Here's the "ruby" way using net/http and uri
require 'net/http'
require 'uri'
uri = URI('http://www.urbandictionary.com/random.php')
response = Net::HTTP.get_response(uri)
response['Location']
# => "http://www.urbandictionary.com/define.php?term=water+bong"
Urban Dictionary is using an HTTP redirect (302 status code, in this case), so the "new" URL is being passed back as an http header (Location). To get a better idea of what the above is doing, here's a way just using curl and a system call
`curl -I 'http://www.urbandictionary.com/random.php'`. # Get the headers using curl -I
split("\r\n"). # Split on line breaks
find{|header| header =~ /^Location/}. # Get the 'Location' header
split(' '). # Split on spaces
last # Get the last element in the split array

How can I convert a relative link in Mechanize to an absolute one?

Is there is a way to convert a Mechanize relative-link object to another one which contains the absolute URL.
Mechanize must know the absolute link, because I can call the click method on relative links too.
You can just merge the page uri (which is always absolute) with the link uri:
page.uri.merge link.uri
This is not specific to Mechanize, but an easy way would be to use the base URL in the <base> tag and add it to the relative URL to use for whatever purpose you want. This generally works.
But, then I'm not sure if you could call the click method on that since I don't know Mechanize that well.
You can also use resolve
Example:
require 'mechanize'
agent = Mechanize.new
page = agent.get(url)
some_rel_url = '/something'
url = agent.resolve(some_rel_url)
Keep in mind that the other answers provided do not take into account all the possibilities to get the base url as described here
Basically this:

Redirect from current page to a new page

I am having trouble with some Ruby CGI.
I have a home page (index.cgi) which is a mix of HTML and Ruby, and has a login form in it.
On clicking on the Submit button the POST's action is the same page (index.cgi), at which point I check to make sure the user has entered data into the correct fields.
I have a counter which increases by 1 each time a field is left empty. If this counter is 0 I want to change the current loaded page to something like contents.html.
With this I have:
if ( errorCount > 0 )
do nothing
else
....
end
What do I need to put where I have the ....?
Unfortunately I cannot use any frameworks as this is for University coursework, so have to use base Ruby.
As for using the CGI#header method as you have suggested, I have tried using this however it is not working for me.
As mentioned my page is index.cgi. This is made of a mixture of Ruby and HTML using "here doc" statements.
At the top of my code page I have my shebang line, following by a HTML header statement.
I then do the CGI form validation part, and within this I have tried doing something like: print this.cgi( { 'Status' => '302 Moved', 'location' =>
'{http://localhost:10000/contents.html' } )
All that happens is that this line is printed at the top of the browser window, above my index.cgi page.
I hope this makes sense.
To redirect the browser to another URL you must output an 30X HTTP response that contains the Location: /foo/bar header. You can do that using the CGI#header method.
Instead of dealing with these details that you do not yet master, I suggest you use a simple framework as Sinatra or, at least, write your script as a Rack-compatible application.
If you really need to use the bare CGI class, have a look at this simple example: https://github.com/tdtds/amazon-auth-proxy/blob/master/amazon-auth-proxy.cgi.

Using and hiding default class

This is my first time getting my hands dirty with CI so I'm getting a little confused.
I'm wanting to accomplish a couple things with my question. First of all, I'd like to always use the default controller without having it to appear in the url. For example, I created a new class named after my site (Example.php) and that works fine. However, if I want to call the search function in my controller I then have to go to example.com/index.php/example/search/.
The second thing I want to accomplish is when I run a search I'll get a nice looking url like so: example.com/search/This+is+a+search (I haven't gotten to removing the index.php portion but I know to use a htaccess). I'm not worried about the actual mechanics of the search, just that I'd like to format the url in this way.
I originally experimented with using a Search class but that found that it doesn't allow me put the search in the url because the second parameter should be a function and not the extra stuff.
Thanks for any help.
In application/config/routes.php file add $route to redirect everything to your controller.
Something like this:
$route['([^\/]+)'] = 'content/index/$1';
$route['([^\/]+)\/([^\/]+)'] = 'content/index/$1/$2';
This will redirect urls like example.com/A and example.com/A/B to a controller named content. Parameters A and B will be passed to method index.

Resources