Mechanize cannot load a page properly - ruby

I want to scrape some pages of this site: Marketbook.ca
So I used for that mechanize. but it does not load pages properly. and it returns a page with empty body, like in the following code:
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Linux Firefox'
agent.get('http://www.marketbook.ca/list/list.aspx?ETID=1&catid=1001&LP=MAT&units=imperial')
What could be the issue here?

Actually this page requires JS engine to display the content:
<noscript>Please enable JavaScript to view the page content.</noscript>
Mechanize doesn't handle pages with JS, so you'd better choose another options like Selenium or WATIR. Both need a real web browser to manipulate.
Another option for you is to look through included JS scripts and figure out where data comes from and query that web resource if it's possible.

Related

Web-scraping with Ruby Mechanize

I have tried scraping a web page with ruby mechanize but is not working. Basically that website have some products and i need the links of the products.
HTML
I was test this code below, and I expected the links to the products but the output doesn't show anything.
`
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://shopee.com.br/Casa-e-Decora%C3%A7%C3%A3o-cat.11059983/')
page.css('col-xs-2-4 shopee-search-item-result__item').each do |product|
puts product.link.text
end
`
Part of the page you are trying to parse is being rendered client side, so when mechanize gets the HTML it does not contain the links you are looking for.
Luckily for you the website is using a JSON API so it is pretty easy to extract the information of the products.

Locating a form with Mechanize in ruby

require 'rubygems'
require 'nokogiri'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://www.instagram.com/accounts/login/')
forms = page.forms.first
pp form
I am trying to locate the form to login to the instagram website. I cannot seem to get mechanize to locate the form even though it should be the only one on the page. When I pretty print the page I get back blank output.
This page uses Javascript to render the form, which mechanize doesn't run. If you want to see what a page looks like without Javascript, you can open it with the lynx browser.
Selenium can be used instead. After installing a driver such as for chrome (see here), the API is pretty similar:
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://www.instagram.com/accounts/login/"
first_form = driver.find_elements(css: "form")[0]

Scraping a Dynamic Page Using Mechanize and Ruby

I'm trying to load the following page through mechanize:
http://www.amazon.com/dp/B014R6MVH2
The Product Description div (div id="productDescription) seems to be a javascript driven section, and as such, is unavailable to mechanize.
Is there any solution to this? Maybe a gem I could use to execute the javascript and see the section?
Another option could be to use a headless browser. I've tried selenium, but it's much, much slower than mechanize.
It works for me:
agent = Mechanize.new
page = agent.get 'http://www.amazon.com/dp/B014R6MVH2'
page.at('#productDescription .content').text
#=> Description This item is a simple and useful wedding banner....

Scraping an AngularJS application

I'm scrapping some HTML pages with Rails, using Nokogiri.
I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered.
Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it?
If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what #tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.
If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.
You can use:
require 'phantomjs'
require 'watir'
b = Watir::Browser.new(:phantomjs)
b.goto URL
doc = Nokogiri::HTML(b.html)
Download phantomjs in http://phantomjs.org/download.html and move the binary for /usr/bin

Using a Ruby script to login to a website via https

Alright, so here's the dealio: I'm working on a Ruby app that'll take data from a website, and aggregate that data into an XML file.
The website I need to take data from does not have any APIs I can make use of, so the only thing I can think of is to login to the website, sequentially load the pages that have the data I need (in this case, PMs; I want to archive them), and then parse the returned HTML.
The problem, though, is that I don't know of any ways to programatically simulate a login session.
Would anyone have any advice, or know of any proven methods that I could use to successfully login to an https page, and then programatically load pages from the site using a temporary cookie session from the login? It doesn't have to be a Ruby-only solution -- I just wanna know how I can actually do this. And if it helps, the website in question is one that uses Microsoft's .NET Passport service as its login/session mechanism.
Any input on the matter is welcome. Thanks.
Mechanize
Mechanize is ruby library which imititates the behaviour of a web browser. You can click links, fill out forms und submit them. It even has a history and remebers cookies. It seems your problem could be easily solved with the help of mechanize.
The following example is taken from http://docs.seattlerb.org/mechanize/EXAMPLES_rdoc.html:
require 'rubygems'
require 'mechanize'
a = Mechanize.new
a.get('http://rubyforge.org/') do |page|
# Click the login link
login_page = a.click(page.link_with(:text => /Log In/))
# Submit the login form
my_page = login_page.form_with(:action => '/account/login.php') do |f|
f.form_loginname = ARGV[0]
f.form_pw = ARGV[1]
end.click_button
my_page.links.each do |link|
text = link.text.strip
next unless text.length > 0
puts text
end
end
You can try use wget to fetch the page. You can analyse login process with this app www.portswigger.net/proxy/.
For what it's worth, you could check out Webrat. It is meant to be used a tool for automated acceptance tests, but I think you could use it to simulate filling out the login fields, then click through links by their names, and grab the needed HTML as a string. Haven't tried doing anything like it, tho.

Resources