I accessed a page that has this link:
<a class="portletpage-portlet-title is-active" tabindex="0" title="Registration" data-ppid="registration_WAR_registration" href="#registration">Registration</a>
The page is encrypted with SSL. The HTML attribute href is #registration. I am trying to follow this link get to the URL:
www.redacted.com/#registration
Here is my code:
agent.get('*redacted*'). do |page|
page.form_with(:action => '*redacted*') do |f|
f.field_with(:id => 'username').value = get_username()
f.field_with(:id => 'password').value = get_password()
end.click_button
agent.page.link_with(:text => 'Registration').click
When it clicks on the link, it produces the following error:
`fetch': 404 => Net::HTTPNotFound for https://*redacted*/group/1403104853945/academics?p_p_id=registration_WAR_uofsregistration&p_p_state=maximized -- unhandled response (Mechanize::ResponseCodeError)
from /home/mike/.rvm/gems/ruby-2.4.1/gems/mechanize-2.7.5/lib/mechanize.rb:464:in `get'
from /home/mike/.rvm/gems/ruby-2.4.1/gems/mechanize-2.7.5/lib/mechanize.rb:348:in `click'
from /home/mike/.rvm/gems/ruby-2.4.1/gems/mechanize-2.7.5/lib/mechanize/page/link.rb:30:in `click'
from u-of-s-scraper.rb:34:in `<main>'
and comes up with the URL:
www.redacted.com/group/1403104853945/academics?p_p_id=registration_WAR_uofsregistration&p_p_state=maximized
I'm not sure where Mechanize is getting the URL. The link has an attribute data-ppid, which appears to be contributing to the URL. Can anyone provide some insight?
It turns out that the page is written using Liferay's Portlets. Unfortunately, Portlets are not directly URL accessible, so I am currently investigating a different means of scraping the page - potentially with Selenium or PhantomJS.
data-ppid is a data attribute, which is supposed to be handled by JavaScript. The change of the URL is probably due to some Javascript code on the client side (and a redirect on the server side).
Links that start with # are "named links" or "bookmark links" - they don't go anywhere, just jump you somewhere on the page.
In other words, there's no reason to ever "follow" a link like that with mechanize.
Related
We need to retrieve access to the logs for our Stripe instance for a specific time period. There isn't an endpoint in there API (grrrr) so we are trying a quick screen scrape, because the dashboard structures them quite nicely.
At this point though I can't even log into Stripe using Mechanize. Below is the code I am using to log in
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
agent.follow_meta_refresh = true
starting_link = 'https://dashboard.stripe.com/login'
page = agent.get(starting_link)
login_form = page.form
login_form.email = email
login_form.password = pass
new_page = agent.submit(login_form, login_form.buttons[0])
The response I get from running this is:
Mechanize::ResponseCodeError: 404 => Net::HTTPNotFound for https://dashboard.stripe.com/login -- unhandled response
from /Users/Nicholas/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.4/lib/mechanize/http/agent.rb:316:in `fetch'
from /Users/Nicholas/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.4/lib/mechanize.rb:1323:in `post_form'
from /Users/Nicholas/.rvm/gems/ruby-2.2.2/gems/mechanize-2.7.4/lib/mechanize.rb:584:in `submit'
from (irb):21
from /Users/Nicholas/.rvm/rubies/ruby-2.2.2/bin/irb:11:in `<main>'
I tried logging into several other sites and was successful. I also aliased the agent and handled the re-direct (a strategy mentioned in other questions).
Does anyone know what tweaks could be made to Mechanize log into Stripe?
Thanks much
Short Answer:
I would suggest using a browser engine like Selenium to get the logs data as that will be much simpler.
Long Answer:
Though your mechanize form submission code is correct, it is assuming the Stripe login form is being submitted using a normal POST request which is not the case.
The Stripe login form is being submitted using an AJAX request.
Here is the working code to take that into account:
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
agent.follow_meta_refresh = true
starting_link = 'https://dashboard.stripe.com/login'
page = agent.get(starting_link)
login_form = page.form
login_form.action = 'https://dashboard.stripe.com/ajax/sessions'
login_form.email = email
login_form.password = password
new_page = agent.submit(login_form, login_form.buttons[0])
As you can see, simply setting the form's action property to the AJAX url solves your problem.
However, once you have logged in successfully, navigating around the site to scrape it for logs will not be possible with mechanize as it does not support javascript. You can check that by requesting the dashboard's url. You will get an error message to enable javascript.
Further, Stripe's dashboard is fully javascript powered. It simply makes an AJAX request to fetch data from the server and then render it as HTML.
This can work for you as the server response is JSON. You can simply parse it and get the required information from logs.
Upon further inspection(in Chrome Developer Tools), I found that the logs are requested from the url https://dashboard.stripe.com/ajax/logs?count=5&include%5B%5D=total_count&limit=10&method=not_get
Again, if you try to access this url using mechanize, you will run into CSRF token problem which is maintained between requests by Stripe.
The CSRF token problem can be solved using mechanize cookies but it will not be worth the effort.
I would suggest using a browser engine like Selenium to get the logs data as that will much simpler.
I want to capture screenshot of the browser URL section.
browser.screenshot.save ('tdbank.png')
It will save the entire page of internal part of the browser, but I want to capture the URL header part of the browser. Any suggestion?
Sometime, URL is saying http or https. I want to capture this in screenshot and archive it. I know I could get it through,
url = browser.url
then do some comparison. I need this for legal purpose and it should be done by taking a screenshot.
thanks in advance.
If you're on windows, you could use the win32screenshot gem. For example:
require 'watir-webdriver'
require 'win32/screenshot'
b = Watir::Browser.new # using firefox as default browser
b.goto('http://www.example.org')
Win32::Screenshot::Take.of(:window, :title => /Firefox/).write("image.bmp")
b.close
I am trying to access the calendar data on an airbnb listing and so far have been unsuccessful. I am using the Mechanize gem in Ruby, and when I try to access the link to access the table, I am encountering the following error:
require 'mechanize'
agent = Mechanize.new
page1=agent.get("https://www.airbnb.com/rooms/726348")
page2=agent.get("https://www.airbnb.com/rooms/calendar_tab_inner2/73944?cal_month=11&cal_year=2013¤cy=USD")
Mechanize::ResponseCodeError: 400 => Net::HTTPBadRequest for https://www.airbnb.com/rooms/calendar_tab_inner2/726348?cal_month=11&cal_year=2013¤cy=USD -- unhandled response
I have also tried to click on the tab that generates the table with the following code, but doing so simply generates the html from the original url.
agent = Mechanize.new
page1=agent.get("https://www.airbnb.com/rooms/726348")
page2=agent.click(page1.link_with(:href => '#calendar'))
Any help would greatly appreciated. Thanks!
I see the problem, you need to check the request headers:
page = agent.get url, nil, nil, {'X-Requested-With' => 'XMLHttpRequest'}
I have this:
<a class="top_level_active" href="javascript:Submit('menu_home')">Account Summary</a>
I want to click that link but I get an error when using link_to.
I've tried:
bot.click(page.link_with(:href => /menu_home/))
bot.click(page.link_with(:class => 'top_level_active'))
bot.click(page.link_with(:href => /Account Summary/))
The error I get is:
NoMethodError: undefined method `[]' for nil:NilClass
That's a javascript link. Mechanize will not be able to click it, since it does not evaluate javascript. Sorry!
Try to find out what happens in your browser when you click that link. Does it create a POST or GET request? What are the parameters that are sent to the server. Once you know that, you can emulate the same action in your Mechanize script. Chrome dev tools / Firebug will help out.
If that doesn't work, try switching to a library that supports javascript evaluation. I've used watir-webdriver to great success, but you could also try out phantomjs, casperjs, pjscrape, or other tools
The first 2 should have worked so try this, print out the hrefs to make sure it's really there:
puts page.links.map(&:href)
Remember that just because you can see it in your browser does not mean it appears in the response. It could have been sent as an ajax update.
Also you can just do this which I think is cleaner:
page.link_with(:href => /menu_home/).click
However I don't think clicking that link will do what you want since it's javascript.
Here's a way to handle it. Assume your page returns this content:
puts page.body
<HTML><SCRIPT LANGUAGE="JavaScript"><!--
top.location="http://www.example.com/pages/myaccount/dashboard.aspx?";
// --></SCRIPT>
<NOSCRIPT>Javascript required.</NOSCRIPT></HTML>
We know it's coming so we know what to check for:
link_search = %r{top.location="([^"]+)"}
js_link = page.body.match(link_search)[1]
page = agent.get(js_link)
I am experimenting with using rspec and watir to do some tdd and have come across a problem I can't seem to get past. I want to have watir click a link (target="_blank") and then get the url of the newly loaded page. Watir clicks the link but when I attempt to get the url I receive the old url not the current. Watir docs seem to indicate that the Browser url method will return the current url. I found a blog post that seems to solve this issue by having Watir execute some javascript to get the current url but this isn't working for me. Is there anyway to get the current url from a link click with Watir?
<!-- the html -->
LinkedIn
#The rspec code
it "should load LinkedIn" do
browser.link(:href => "http://www.linkedin.com").click
browser.url.should == "http://www.linkedin.com"
end
The target will load the link in a new browser window, therefore you need to switch to that window to assert the url:
it "should load LinkedIn" do
browser.link(:href => "http://www.linkedin.com").click
browser.window(:title => /.*LinkedIn.*/).use do
browser.url.should == "http://www.linkedin.com"
end
end
See: http://watirwebdriver.com/browser-popups/ for more examples