crawling dynamic content using ruby

crawling dynamic content using ruby - ruby

I am using ruby gems (nokogiri & mechanize) to make a crawler for a website but this website contains bootstrap modals (popup windows) that's generated dynamically on button click.
this content (of modal) shows up on button click that uses a "get" method on some URL.
I am getting the response by crawling the URL associated with the button
but I am just getting the same page source.
how could I get the content of that dynamic content using "ruby" ?

That modal you're describing, with high probability is rendered with a Js. So what you're looking for is not possible, because mentioned libs do not execute Js.
In order to parse pages whose content is Js dependent, you should use other tools, e.g. puppeteer

Related

How to click an element on the page without loading it with capybara?

Is it possible to click on the element without loading the page using
capybara and Ruby?

No that is not possible. You need to load the page for the browser to know what to do when the element is clicked.
If the element being clicked just triggers a web request and you know what that web request is then you could just make that request using any of the ruby networking libraries but then you may run into issues with referrer, cookies, etc.

How can I use mechanize to click button to webscrape a page to get information?

I'm looking to scrape the contents of a page that requires you to press an arrow button in which, information will appear via jquery rather than loading a new page. Since there needs to be a button click, I'm using mechanize for this part instead of nokogiri. What I have so far is
url = "http://brokercheck.finra.org/Individual/Summary/1327992"
mechanize = Mechanize.new
page = mechanize.get(url)
button = page.at('.ArrowExpandDsclsr.faangledown')
new_page = mechanize.click(button)
new_page.at('#disclosuredetails')
It appears that new_page still doesn't show the page with the newly loaded information. Anyone know why that is?

The button you're trying to get mechanize to click is not a "regular" button, it's a bit more dynamic. It uses javascript / ajax to fetch the relevant data when clicked.
Mechanize doesn't render the DOM of a web page nor it provides a way to have javascript interact with the page. Thus it's not suitable for interacting with dynamic pages depending on javascript for their functionality.
For such cases, I'd suggest phantomjs, either standalone or through capybara / poltergeist if you'd prefer interacting with it via ruby.

Scraping an AngularJS application

I'm scrapping some HTML pages with Rails, using Nokogiri.
I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered.
Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it?

If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what #tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.
If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.

You can use:
require 'phantomjs'
require 'watir'
b = Watir::Browser.new(:phantomjs)
b.goto URL
doc = Nokogiri::HTML(b.html)
Download phantomjs in http://phantomjs.org/download.html and move the binary for /usr/bin

How to view JS code loaded with AJAX in browser?

I have a JSP page, where some parts of the pages are loaded from the backend using AJAX. For example, when I first open the page, the URL is http://www.made-up-domain-name-because-of-stack-overflow-restrictions.com/listUsers.do. The page contains an "add user" button, which loads HTML content (containing a form etc.) from the backend to the div-element with id "addArea". The URL stays the same the whole time (naturally), as the request is done in the background.
The problem I have is that the content loaded using AJAX is not completely viewable with any means.
Using Firefox I can see the new HTML with the Firebug add-on and "Inspect element", but the content within the script-tags is not visible that way (also not in the "Script" tab in Firebug - only the originally loaded scripts appear there). If I use "View page source" in FF a page reload is executed and I don't see the newly generated content (I only see the content of page http://www.made-up-domain-name-because-of-stack-overflow-restrictions.com/listUsers.do as it was when first loaded).
With Chrome I have the same problem as with Firefox.
Using IE I see only the original source.
Of course I can work around this by adding debugging mechanisms to the JS code and working half-blind, or moving parts of the JS code to external files etc., but if by any means possible, I would prefer to just view the code loaded using AJAX. Any suggestions, perhaps using some add-on?
Update: There is a better way: see the accepted answer for this question: How to debug dynamically loaded javascript(with jquery) in the browser's debugger itself?

You can use the JavaScript Deobfuscator extension for that. It can show you what scripts are compiled/executed on a webpage - including the ones that were loaded dynamically.

How to parse malformed HTML with Ruby and Mechanize

I am using Mechanize to navigate a site which has badly-malformed HTML. In particular, I have a page which has checkboxes outside of a form which the server handles the requests sanely.
I would like to check these boxes and click a "Submit" button which is also outside the form, however, I can't use Form.checkbox_with because I don't have a Form object, I only have the Page. I can locate the checkbox on the page with
page.search("//input[#name='silly-checkbox']")
but I can't check it afterwards because Nokogiri is only used for scraping and does not track state. Is that incorrect?
How can I get a Mechanize::Form::Checkbox object when my checkbox is not in a form?

You could manually load the remote page using Nokogiri, then fix the markup by finding the checkboxes outside the form and wrap them, and construct Mechanize classes by yourself from the fixed HTML code.

You can modify your form by deleting and merging new fields.
form.add_field!('gender', 'male')
rdoc here

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio