Ruby Mechanize Page to String - ruby

I have captured a Mechanize page. How can I get that item into a string? Pretty Print is used to output that object, however I'd like to get that into a string for further instructions. I can't seem to find any method.
Any advice appreciated.
Cheers

Never needed to save the page content to a string but this works:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.google.com")
s = page.content

Related

Find a url in a document using regex in ruby

I have been trying to find a url in a html document and this has to be done in regex since the url is not in any html tag so I can't use nokogiri for that. To get the html i used httparty and i did it this way
require 'httparty'
doc = HTTParty.get("http://127.0.0.1:4040")
puts doc
That outputs the html code. And to get the url i used the .split() method to reach to the url. The full code is
require 'httparty'
doc = HTTParty.get('http://127.0.0.1:4040').split(".ngrok.io")[0].split('https:')[2]
puts "https:#{doc}.ngrok.io"
I wanted to do this using regex since ngrok might update their localhost html file and so this code won't work anymore. How do i do it?
If I understood correctly you want to find all hostnames matching "https://(any subdomain).ngrok.io", right ?
If then you want to use String#scan with a regexp. Here is an example:
# get your body (replace with your HTTP request)
body = "my doc contains https://subdomain.ngrok.io and https://subdomain-1.subdomain.ngrok.io"
puts body
# Use scan and you're done
urls = body.scan(%r{https://[0-9A-Za-z-\.]+\.ngrok\.io})
puts urls
It will result in an array containing ["https://subdomain.ngrok.io", "https://subdomain-1.subdomain.ngrok.io"]
Call .uniq if you want to get rid of duplicates
This doesn't handle ALL edge cases but it's probably enough for what you need

How to get the current URL for a HTML page

I am scraping a website using Nokogiri. This particular website deals with absolute URLs differently.
If I give it a URL like:
page = Nokogiri::HTML(open(link, :allow_redirections => :all))
it will redirect to the HTTPS version, and also redirect to the long version of the URL. For example, a link like
http://www.website.com/name
turns into
http://www.website.com/other-area/name
This is fine and doesn't really affect my scraper, however, there are certain edge-cases where, if I can tell my scraper what the current URL is, I can avoid them.
After I pass in the above link to my page variable, how can I get the current URL of that page after the redirect happens?
I'm assuming you're using the open_uri_redirections gem because :allow_redirections is not necessary in Ruby 2.4+.
Save the result of OpenURI's open:
require 'open-uri'
r = open('http://www.google.com/gmail')
r.base_uri
# #<URI::HTTPS https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1#>
page = Nokogiri::HTML(r)
Use Mechanize, then you can do:
agent = Mechanize.new
page = agent.get url
puts page.uri # this will be the redirected url

Get the default webpage file name

Is there a way to get the default web page of a given url in ruby?
I'm looking for a function like
get_indexpage_for("www.example.com")
with a result that's equal to something like
'index.html' or 'index.php' or 'index.htm' or ...
Even the HTTP-header doesn't contain this information and i've also looked at the Net::HTTP class but i couldn't find a solution.
Can someone pls help?
This will do it if there actually is a url that can be discerned. It works like a charm on some pages and not on others.
It should work on the url I've used in my example...
require 'mechanize'
require 'pp'
agent = Mechanize.new
login_url = 'http://www.reports.rtui.com'
page = agent.get(login_url)
puts page.uri
index.html is the standard default, but if you go to google.com they don't appear to have an index page. Instead it runs more like an application, serving content as its requested.
I'm no pro by any measure, but based on my research there doesn't seem to be one magic bullet that does what you want. At least, not one that's obvious. It really depends on the page itself.

Submitting login fields during a scraping process with ruby?

I need to scrape some financial data from a system called NetTeller.
An example can be found here.
Note the initial ID field prompt:
Then once you submit you have to then enter your password:
As you can see, it has a two step process where you first enter an ID number and then after submission the user is presented with a password field. I'm hitting some roadbumps here when it comes to jumping through these two hoops prior to getting on into the system and getting to the data that I actually want. How would one process a scenario such as this where you need to pass through the authentication fields prior first before getting to the data you want to scrape?
I have assumed that I could just jump in with httpclient and nokogiri, but am curious if there are any tricks when dealing with a two-page login such as this before getting into your target.
I would use Mechanize. The first page is "tricky" because the login form is within an iframe. So you could use just the source where the iframe is being loaded. Here is how:
agent = Mechanize.new
# Get first page
iframe_url = 'https://www.banksafe.com/sfonline/'
page = agent.get(iframe_url)
login_form = page.forms.first
username_field = login_form.field_with(:name => "12345678")
# Get second page
response = login_form.submit
second_login_form = response.forms.first
password_field = second_login_form.field_with(:password => "xxxxx")
# Get page to scrap
response = second_login_form.submit
This is how you could process an scenario like this. Obviously you might need to adapt to exactly how those forms/fields are written and other specific-page details, but I would go for this approach.

How can I convert a relative link in Mechanize to an absolute one?

Is there is a way to convert a Mechanize relative-link object to another one which contains the absolute URL.
Mechanize must know the absolute link, because I can call the click method on relative links too.
You can just merge the page uri (which is always absolute) with the link uri:
page.uri.merge link.uri
This is not specific to Mechanize, but an easy way would be to use the base URL in the <base> tag and add it to the relative URL to use for whatever purpose you want. This generally works.
But, then I'm not sure if you could call the click method on that since I don't know Mechanize that well.
You can also use resolve
Example:
require 'mechanize'
agent = Mechanize.new
page = agent.get(url)
some_rel_url = '/something'
url = agent.resolve(some_rel_url)
Keep in mind that the other answers provided do not take into account all the possibilities to get the base url as described here
Basically this:

Resources