Scraping an AngularJS application - ruby

I'm scrapping some HTML pages with Rails, using Nokogiri.
I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered.
Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it?

If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what #tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.
If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.

You can use:
require 'phantomjs'
require 'watir'
b = Watir::Browser.new(:phantomjs)
b.goto URL
doc = Nokogiri::HTML(b.html)
Download phantomjs in http://phantomjs.org/download.html and move the binary for /usr/bin

Related

Mechanize can't find form

I'm having some problem accessing the form element on a page I'm getting using Mechanize.
username_page = agent.get 'https://member.carefirst.com/mos/#/home'
username_form = username_page.form_with(name: 'soloLoginForm')
username_form is nil. (username_page does have the page). The page definitely has a form and the field is #soloLoginForm, but username_page.body has no form element.
I'm guessing this is some async or dynamic issue. I'm able to grab the form with poltergeist, and I'm looking into doing all my form filling with capybara/poltergeist, but I wonder if there's something simple I'm missing that will allow me to use mechanize, as I'd planned.
It seems to be that 'https://member.carefirst.com/mos/#/home' uses Angular to render elements of the page and AngularJS requires Javascript support in the browser or in your case Capybara needs a driver with Javascript support.
Mechanize doesn't support Javascript, check this old SO thread. This is probably the reason why it works when you try with poltergeist.
Check: https://github.com/teamcapybara/capybara#drivers
As stated in the answer by #hernanvicente the page is using Angular and requires JS (which mechanize does not support). However, you really want to be using selenium with headless Chrome rather than Poltergeist nowadays. Poltergeist is equivalent to about a 7 year old version of Safari (due to PhantomJS, which it uses for rendering, being abandoned) so it doesn't support a lot of JS and CSS used in modern sites. Another advantage of using Selenium with Chrome is that you can easily swap between headless and headed to see what's happening when you need to debug things.

crawling dynamic content using ruby

I am using ruby gems (nokogiri & mechanize) to make a crawler for a website but this website contains bootstrap modals (popup windows) that's generated dynamically on button click.
this content (of modal) shows up on button click that uses a "get" method on some URL.
I am getting the response by crawling the URL associated with the button
but I am just getting the same page source.
how could I get the content of that dynamic content using "ruby" ?
That modal you're describing, with high probability is rendered with a Js. So what you're looking for is not possible, because mentioned libs do not execute Js.
In order to parse pages whose content is Js dependent, you should use other tools, e.g. puppeteer

Why is my ajax content not being indexed by google

I have tried to set my site up ( http://www.diablo3values.com )according to the guidelines set out here : https://developers.google.com/webmasters/ajax-crawling/ However, it appears that Google has updated their indexes (because I see the revisions to the meta description tags) but the ajax content does not show up in the index.
I am trying to use the “Handle pages without hash fragments” option.
If you view either of the following:
http://www.diablo3values.com/?_escaped_fragment_=
http://www.diablo3values.com/about?_escaped_fragment_=
you will correctly see the HTML snap shot with my content. (those are the two pages I an most concerned about).
Any Ideas? Am I doing something wrong? How do you get google to correclty recognize the tag.
I'm typing this as an answer, since it got a little to long to be a comment.
First of all, your links seems to point to localhost:8080/about, and not /about, which probably is why google doesn't index it in the first place.
Second, here's my experience with pushstate urls and Google AJAX crawling:
My experience is that ajax crawling with pushstate urls is handled a little differently by google than with hashbang urls. Since google won't know that your url is a pushstate url (since it looks just like a regular url), you need to add <meta name="fragment" content="!"> to all your pages, not only the "root" page. And google doesn't seem to know that the pages are part of the same application, so it treats every page as a separate Ajax application. So the Google bot will never actually create a navigation structure inside _escaped_fragment_, like _escaped_fragment_=/about, as it would with a hashbang url (#!/about). Instead, it will request /about?_escaped_fragment_= (which you aparently already have set up). This goes for all your "deep links". Instead of /?_escaped_fragment_=/thelink, google will always request /thelink?_escaped_fragment_=.
But as said initially, the reason it doesn't work for you is probably because you have localhost:8080 urls in your _escaped_fragment_ generated html.
Googlebot only knows to crawl the escaped fragment if your urls conform to the hash bang standard. As users navigate your site, your urls need to be:
http://www.diablo3values.com/
http://www.diablo3values.com/#!contact
http://www.diablo3values.com/#!about
Googlebot actually needs to see these urls in the source code so that it can follow them. Then it knows to download the following urls:
http://www.diablo3values.com/?_escaped_fragment=contact
http://www.diablo3values.com/?_escaped_fragment=about
On your site you appear to be loading a new page on each click, and then loading the content of each page via AJAX too. This is not how I would expect an AJAX site to work. Usually the purpose of using AJAX is so that the user never has to load a whole new page. When the user clicks, the new content section is loaded and inserted into the page. You serve the navigation once and then you only serve escaped fragments of the content.

How to get the raw HTML source code for a page by using Ruby or Nokogiri?

I'm using Nokogiri (Ruby Xpath library) to grep contents on web pages. Then I found problems with some web pages, such as Ajax web pages, and that means when I view source code I won't be seeing the exact contents such as <table>, etc.
How can I get the HTML code for the actual content?
Don't use Nokogiri at all if you want the raw source of a web page. Just fetch the web page directly as a string, and then do not feed that to Nokogiri. For example:
require 'open-uri'
html = open('http://phrogz.net').read
puts html.length #=> 8461
puts html #=> ...raw source of the page...
If, on the other hand, you want the post-JavaScript-modified contents of a page (such as an AJAX library that executes JavaScript code to fetch new content and change the page), then you can't use Nokogiri. You need to use Ruby to control a web browser (e.g. read up on Selenium or Watir).

How do I fetch AJAX-loaded content from an another site using Nokogiri?

I was trying to parse some HTML content from a site. Nokogiri works perfectly for content loaded the first time.
Now the issue is how to fetch that content which is loaded using AJAX. For example, there is a "see more" link and more items are fetched using AJAX, or consider a case for AJAX-based tabs.
How can I fetch that content?
You won't be able to parse anything that requires a JavaScript runtime to produce that content using Nokogiri. Nokogiri is a HTML/XML parser, not a web browser.
PhantomJS on the other hand is a web browser, albeit a special kind of browser ;) Take a look at that and have a play.
It isn't completely clear what you want to do, but if you are trying to get access to additional HTML that is loaded by AJAX, then you will need to study the code, figure out what URL is being used for the AJAX request, whether any session IDs or cookies have been set, then create a new URL that reproduces what AJAX is using. Request that, and you should get the new content back.
That can be difficult to do though. As #Nuby said, Mechanize could be good help, as it is designed to manage cookies and sessions for you in the background. Mechanize uses Nokogiri internally so if you request a page from Mechanize, you can use Nokogiri searches against it to drill down and extract any particular JavaScript strings. They'll be present as text, so then you can use regex or substring matches to get at the particular parameters you need, then construct the new URL and ask Mechanize to get it.

Resources