I am scraping a website using Nokogiri. This particular website deals with absolute URLs differently.
If I give it a URL like:
page = Nokogiri::HTML(open(link, :allow_redirections => :all))
it will redirect to the HTTPS version, and also redirect to the long version of the URL. For example, a link like
http://www.website.com/name
turns into
http://www.website.com/other-area/name
This is fine and doesn't really affect my scraper, however, there are certain edge-cases where, if I can tell my scraper what the current URL is, I can avoid them.
After I pass in the above link to my page variable, how can I get the current URL of that page after the redirect happens?
I'm assuming you're using the open_uri_redirections gem because :allow_redirections is not necessary in Ruby 2.4+.
Save the result of OpenURI's open:
require 'open-uri'
r = open('http://www.google.com/gmail')
r.base_uri
# #<URI::HTTPS https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1<mpl=default<mplcache=2&emr=1&osid=1#>
page = Nokogiri::HTML(r)
Use Mechanize, then you can do:
agent = Mechanize.new
page = agent.get url
puts page.uri # this will be the redirected url
Related
Is there a way to get the default web page of a given url in ruby?
I'm looking for a function like
get_indexpage_for("www.example.com")
with a result that's equal to something like
'index.html' or 'index.php' or 'index.htm' or ...
Even the HTTP-header doesn't contain this information and i've also looked at the Net::HTTP class but i couldn't find a solution.
Can someone pls help?
This will do it if there actually is a url that can be discerned. It works like a charm on some pages and not on others.
It should work on the url I've used in my example...
require 'mechanize'
require 'pp'
agent = Mechanize.new
login_url = 'http://www.reports.rtui.com'
page = agent.get(login_url)
puts page.uri
index.html is the standard default, but if you go to google.com they don't appear to have an index page. Instead it runs more like an application, serving content as its requested.
I'm no pro by any measure, but based on my research there doesn't seem to be one magic bullet that does what you want. At least, not one that's obvious. It really depends on the page itself.
I have captured a Mechanize page. How can I get that item into a string? Pretty Print is used to output that object, however I'd like to get that into a string for further instructions. I can't seem to find any method.
Any advice appreciated.
Cheers
Never needed to save the page content to a string but this works:
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.google.com")
s = page.content
I need to scrape some financial data from a system called NetTeller.
An example can be found here.
Note the initial ID field prompt:
Then once you submit you have to then enter your password:
As you can see, it has a two step process where you first enter an ID number and then after submission the user is presented with a password field. I'm hitting some roadbumps here when it comes to jumping through these two hoops prior to getting on into the system and getting to the data that I actually want. How would one process a scenario such as this where you need to pass through the authentication fields prior first before getting to the data you want to scrape?
I have assumed that I could just jump in with httpclient and nokogiri, but am curious if there are any tricks when dealing with a two-page login such as this before getting into your target.
I would use Mechanize. The first page is "tricky" because the login form is within an iframe. So you could use just the source where the iframe is being loaded. Here is how:
agent = Mechanize.new
# Get first page
iframe_url = 'https://www.banksafe.com/sfonline/'
page = agent.get(iframe_url)
login_form = page.forms.first
username_field = login_form.field_with(:name => "12345678")
# Get second page
response = login_form.submit
second_login_form = response.forms.first
password_field = second_login_form.field_with(:password => "xxxxx")
# Get page to scrap
response = second_login_form.submit
This is how you could process an scenario like this. Obviously you might need to adapt to exactly how those forms/fields are written and other specific-page details, but I would go for this approach.
Is there is a way to convert a Mechanize relative-link object to another one which contains the absolute URL.
Mechanize must know the absolute link, because I can call the click method on relative links too.
You can just merge the page uri (which is always absolute) with the link uri:
page.uri.merge link.uri
This is not specific to Mechanize, but an easy way would be to use the base URL in the <base> tag and add it to the relative URL to use for whatever purpose you want. This generally works.
But, then I'm not sure if you could call the click method on that since I don't know Mechanize that well.
You can also use resolve
Example:
require 'mechanize'
agent = Mechanize.new
page = agent.get(url)
some_rel_url = '/something'
url = agent.resolve(some_rel_url)
Keep in mind that the other answers provided do not take into account all the possibilities to get the base url as described here
Basically this:
I need the ability to grab reports off of a particular website. The below method below does everything I need it to do, the only catch is the report, "report.csv", is served back with "content-disposition:filename=report.csv" in the response header when the page is posted (the page posts to itself).
def download_report
page = #mechanize.click(#mechanize.current_page().link_with(:text => /Reporting/))
page.form.field_with(:name => "rep").option_with(:value => "adperf").click
page.form_with(:name => "get-report").field_with(:id => "sasReportingQuery.dateRange").option_with(:value => "Custom").click
start_date = DateTime.parse(#start_date)
end_date = DateTime.parse(#end_date)
page.form_with(:name => "get-report").field_with(:name => "sd_display").value = start_date.strftime("%m/%d/%Y")
page.form_with(:name => "get-report").field_with(:name => "ed_display").value = end_date.strftime("%m/%d/%Y")
page.form_with(:name => "get-report").submit
end
As far as I can tell, Mechanize is not capturing the file anywhere that I can get to it. Is there a way to get Mechanize to capture and download this file?
#mechanize.current_page() does not contain the file and #mechanize.history() does not show that the file url was presented to Mechanize.
The server appears to be telling the browser to save the document. "Content-disposition:filename" is the clue to that. Mechanize won't know what to do with that, and will try to read and parse the content, which, if it's a CSV, will not work.
Without seeing the HTML page you're working with it's impossible to know exactly what mechanism they're using to trigger the download. Clicking an element could fire a JavaScript event, which Mechanize won't handle. Or, it could send a form to the server, which responds with the document download. In either case, you have to figure out what is being sent, why, and what specifically defines the document you want, then use that information to request the document.
Mechanize isn't the right tool to download an attachment. Use Mechanize to navigate forms, then use Mechanize's embedded Nokogiri to extract the URL for the document.
Then use something like curb or Ruby's built-in OpenURI to retrieve the attachment, or see "Using WWW:Mechanize to download a file to disk without loading it all in memory first" for more information.
Check the class of the returned page page.class. if it is File then you can just save it.
...
page = page.form_with(:name => "get-report").submit
page.class # File?
page.save('path/to/file')