opening this page https://www.barchart.com/futures/quotes/ESH20/options
in Nokogiri doesn't have the same elements as in the rendered page inside the browser.
How can I access the same source code as seen in the Browser DevTools from the scraper library?
this element in particular <div class="bc-datatable"...
is required an headless browser to get the right page code first?
That data is coming from a JSON endpoint:
So luckily no headless browser required this time.
Related
Is it possible to click on the element without loading the page using
capybara and Ruby?
No that is not possible. You need to load the page for the browser to know what to do when the element is clicked.
If the element being clicked just triggers a web request and you know what that web request is then you could just make that request using any of the ruby networking libraries but then you may run into issues with referrer, cookies, etc.
I am using ruby gems (nokogiri & mechanize) to make a crawler for a website but this website contains bootstrap modals (popup windows) that's generated dynamically on button click.
this content (of modal) shows up on button click that uses a "get" method on some URL.
I am getting the response by crawling the URL associated with the button
but I am just getting the same page source.
how could I get the content of that dynamic content using "ruby" ?
That modal you're describing, with high probability is rendered with a Js. So what you're looking for is not possible, because mentioned libs do not execute Js.
In order to parse pages whose content is Js dependent, you should use other tools, e.g. puppeteer
Is there a way to get the html of the whole page with rendered iframes html also included? I know you can access browser html like so.
browser.html
but that only prints the current's DOM html, without the iframe contents. I also know the iframe content can be retrieve like so .
browser.iframe(index: 0).html
But is there a way to get the browser page contents, with all the iframes content rendered included also; the reason why I'm asking is because some iframes might have other iframes embedded and so on. So it becomes cumbersome to parse.
Thanks.
I'm scrapping some HTML pages with Rails, using Nokogiri.
I had some problems when I tried to scrap an AngularJS page because the gem is opening the HTML before it has been fully rendered.
Is there some way to scrap this type of page? How can I have the page fully rendered before scraping it?
If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what #tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.
If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.
You can use:
require 'phantomjs'
require 'watir'
b = Watir::Browser.new(:phantomjs)
b.goto URL
doc = Nokogiri::HTML(b.html)
Download phantomjs in http://phantomjs.org/download.html and move the binary for /usr/bin
I have a JSP page, where some parts of the pages are loaded from the backend using AJAX. For example, when I first open the page, the URL is http://www.made-up-domain-name-because-of-stack-overflow-restrictions.com/listUsers.do. The page contains an "add user" button, which loads HTML content (containing a form etc.) from the backend to the div-element with id "addArea". The URL stays the same the whole time (naturally), as the request is done in the background.
The problem I have is that the content loaded using AJAX is not completely viewable with any means.
Using Firefox I can see the new HTML with the Firebug add-on and "Inspect element", but the content within the script-tags is not visible that way (also not in the "Script" tab in Firebug - only the originally loaded scripts appear there). If I use "View page source" in FF a page reload is executed and I don't see the newly generated content (I only see the content of page http://www.made-up-domain-name-because-of-stack-overflow-restrictions.com/listUsers.do as it was when first loaded).
With Chrome I have the same problem as with Firefox.
Using IE I see only the original source.
Of course I can work around this by adding debugging mechanisms to the JS code and working half-blind, or moving parts of the JS code to external files etc., but if by any means possible, I would prefer to just view the code loaded using AJAX. Any suggestions, perhaps using some add-on?
Update: There is a better way: see the accepted answer for this question: How to debug dynamically loaded javascript(with jquery) in the browser's debugger itself?
You can use the JavaScript Deobfuscator extension for that. It can show you what scripts are compiled/executed on a webpage - including the ones that were loaded dynamically.