Web-scraping with Ruby Mechanize - ruby

I have tried scraping a web page with ruby mechanize but is not working. Basically that website have some products and i need the links of the products.
HTML
I was test this code below, and I expected the links to the products but the output doesn't show anything.
`
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://shopee.com.br/Casa-e-Decora%C3%A7%C3%A3o-cat.11059983/')
page.css('col-xs-2-4 shopee-search-item-result__item').each do |product|
puts product.link.text
end
`

Part of the page you are trying to parse is being rendered client side, so when mechanize gets the HTML it does not contain the links you are looking for.
Luckily for you the website is using a JSON API so it is pretty easy to extract the information of the products.

Related

Scraping a Dynamic Page Using Mechanize and Ruby

I'm trying to load the following page through mechanize:
http://www.amazon.com/dp/B014R6MVH2
The Product Description div (div id="productDescription) seems to be a javascript driven section, and as such, is unavailable to mechanize.
Is there any solution to this? Maybe a gem I could use to execute the javascript and see the section?
Another option could be to use a headless browser. I've tried selenium, but it's much, much slower than mechanize.
It works for me:
agent = Mechanize.new
page = agent.get 'http://www.amazon.com/dp/B014R6MVH2'
page.at('#productDescription .content').text
#=> Description This item is a simple and useful wedding banner....

How to parse websites that use angularjs?

I want to know how to parse a website that using angularjs as its front end framework.
The following code parses http://www.pluralsight.com/courses/using-stackoverflow-stackexchange-sites to get the course title.
What I got is {{course.title}} instead of the actual course title. Can anyone give me some suggestions?
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.pluralsight.com/courses/using-stackoverflow-stackexchange-sites"))
title = doc.css("h1").first.text
puts title # => {{course.title}}
Google has good docs on how to set up SEO for ajax driven sites. The site in question has followed these guidelines.
Using the <base> tag of that page as a path reference you can access rendered html using this path:
http://www.pluralsight.com/courses?_escaped_fragment=/using-stackoverflow-stackexchange-sites
Reference: Google Ajax Crawling Spec
As an alternative you can use a headless browser to render the page and use that as your source
You can use:
require 'phantomjs'
require 'watir'
b = Watir::Browser.new(:phantomjs)
b.goto URL
doc = Nokogiri::HTML(b.html)
#title = doc.css('h1').first.text
Download phantomjs in http://phantomjs.org/download.html and move the binary for /usr/bin

Mechanize cannot load a page properly

I want to scrape some pages of this site: Marketbook.ca
So I used for that mechanize. but it does not load pages properly. and it returns a page with empty body, like in the following code:
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Linux Firefox'
agent.get('http://www.marketbook.ca/list/list.aspx?ETID=1&catid=1001&LP=MAT&units=imperial')
What could be the issue here?
Actually this page requires JS engine to display the content:
<noscript>Please enable JavaScript to view the page content.</noscript>
Mechanize doesn't handle pages with JS, so you'd better choose another options like Selenium or WATIR. Both need a real web browser to manipulate.
Another option for you is to look through included JS scripts and figure out where data comes from and query that web resource if it's possible.

How to get the raw HTML source code for a page by using Ruby or Nokogiri?

I'm using Nokogiri (Ruby Xpath library) to grep contents on web pages. Then I found problems with some web pages, such as Ajax web pages, and that means when I view source code I won't be seeing the exact contents such as <table>, etc.
How can I get the HTML code for the actual content?
Don't use Nokogiri at all if you want the raw source of a web page. Just fetch the web page directly as a string, and then do not feed that to Nokogiri. For example:
require 'open-uri'
html = open('http://phrogz.net').read
puts html.length #=> 8461
puts html #=> ...raw source of the page...
If, on the other hand, you want the post-JavaScript-modified contents of a page (such as an AJAX library that executes JavaScript code to fetch new content and change the page), then you can't use Nokogiri. You need to use Ruby to control a web browser (e.g. read up on Selenium or Watir).

Using a Ruby script to login to a website via https

Alright, so here's the dealio: I'm working on a Ruby app that'll take data from a website, and aggregate that data into an XML file.
The website I need to take data from does not have any APIs I can make use of, so the only thing I can think of is to login to the website, sequentially load the pages that have the data I need (in this case, PMs; I want to archive them), and then parse the returned HTML.
The problem, though, is that I don't know of any ways to programatically simulate a login session.
Would anyone have any advice, or know of any proven methods that I could use to successfully login to an https page, and then programatically load pages from the site using a temporary cookie session from the login? It doesn't have to be a Ruby-only solution -- I just wanna know how I can actually do this. And if it helps, the website in question is one that uses Microsoft's .NET Passport service as its login/session mechanism.
Any input on the matter is welcome. Thanks.
Mechanize
Mechanize is ruby library which imititates the behaviour of a web browser. You can click links, fill out forms und submit them. It even has a history and remebers cookies. It seems your problem could be easily solved with the help of mechanize.
The following example is taken from http://docs.seattlerb.org/mechanize/EXAMPLES_rdoc.html:
require 'rubygems'
require 'mechanize'
a = Mechanize.new
a.get('http://rubyforge.org/') do |page|
# Click the login link
login_page = a.click(page.link_with(:text => /Log In/))
# Submit the login form
my_page = login_page.form_with(:action => '/account/login.php') do |f|
f.form_loginname = ARGV[0]
f.form_pw = ARGV[1]
end.click_button
my_page.links.each do |link|
text = link.text.strip
next unless text.length > 0
puts text
end
end
You can try use wget to fetch the page. You can analyse login process with this app www.portswigger.net/proxy/.
For what it's worth, you could check out Webrat. It is meant to be used a tool for automated acceptance tests, but I think you could use it to simulate filling out the login fields, then click through links by their names, and grab the needed HTML as a string. Haven't tried doing anything like it, tho.

Resources