Getting Dynamically Generated HTML With Nokogiri/Open URI - ruby

I'm trying to scrape a site by looking at its HTML in Chrome and grabbing the data using Nokogiri. The problem is that some of the tags are dynamically generated, and they don't appear with an open(url) request when using open-uri. Is there a way to "force" a site to dynamically generate its content for a tool like open uri to read?

If reading it via open-uri doesn't produce the content you need, then chances are good that the client is generating content with Javascript.
This may be good news - by inspecting the AJAX requests that the page makes, you might find a JSON feed of the content you're looking for, which you can then request and parse directly. This would get you your data without having to dig through the HTML - handy!
If that doesn't work for some reason, though, you're going to need to open the page with some kind of browser, let it execute its clientside Javascript, then dump the resulting DOM to HTML. Something like PhantomJS is an excellent choice for this kind of work.

Related

Is there a simple way to have Mechanize get all the components of a webpage?

It's my understanding that when I do:
agent = Mechanize.New
page = agent.get("http://www.stackoverflow.com/")
Mechanize will make an HTTP GET request for the text/html. However when I navigate to a webpage such as Stackoverflow.com in a full web browser (like Chrome/Firefox) the browser reads in the HTML page and makes subsequent GET requests for the associated CSS, images, JavaScript, etc.
I can imagine parsing the initial HTML returned by Mechanize and identifying any CSS, images, etc., and making subsequent requests, but is there an easier way of having Mechanize automatically grab all, or a specified group, perhaps just the images of the associated components of a web page?
I would take a look at the Mechanize::PluggableParsers that are available. One of them probably does what you want.
No, Mechanize won't do that. Besides, what would be the point of Mechanize retrieving non-text content it can't parse?
Instead, identify the parts you want, and use Net::HTTP, Curb, Open-URI, Typhoeus, or any of the other HTTP-based tools to retrieve the content and save it to disk.
Actually, unless I needed Mechanize to navigate through some forms first, or maintain sessions, I'd write a small Ruby script that uses Nokogiri to pull out the needed elements. If you have to use Mechanize for the initial navigation, it'll load Nokogiri automatically to handle its DOM parsing, so piggy-back on the Mechanize page it can give you, which is a Nokogiri::HTML document. Search through the related links on the right for more information.

Heroku Friendly Ruby Scraper (with AJAX)?

So I'm trying to write a small unofficial API in Ruby to pull data from a site. I like Mechanize, but all the data I need from the page is generated by AJAX so Mechanize doesn't see it at all. What can I use to render a page with JavaScript so that I can scrape the data? I think something like spynner but for Ruby would do the trick.
I would also like to play with Heroku, so I'm looking for something that could be deployed there, which leads me away from something like Watir.
Does anything like this exist?
Update
For clarity, I'm trying to pull workout data from a Fitocracy profile page.
You may need an account before you can view the page, but basically all the workout data is displayed via JavaScript inside a page shell.
An ajax request is the same as a non-ajax request, it's just not always obvious how to make it. Mechanize can make any request that a browser can make. Sure Watir is easier but if this is for an API you should do it the right way and use mechanize.

How to load a page in background?

I have a project where we're using an iframe. However, project specs have changed and we can no longer use the iframe. Instead we need to request the html page in the background and display it on page when loaded.
Any ideas on how to do this via Ruby (rails). Thought best to ask for general direction before diving in.
Thanks!
load it with ajax, and do a body append
It depends on where you want to have the work occur, on the back-end in your Rails code, or in the user's browser via JavaScript, as #stunaz suggests.
Keeping it in the browser and loading via JavaScript will expose the HTML page's location to the user, which might not be desirable. Loading it from the back-end and including it in the HTML emitted to the browser will hide the source entirely.
If you want to do it on the back-end, the simplest thing is to either load the file from the local drive, if it is local using File.read. If it's on a different machine, you can use Open::URI to pull it in. Either way, you'd then insert it into the HTML in the right spot. How you do that depends on what you are using to generate the outgoing HTML.

Google Search optimisation for ajax calls

I have a page on my site which has a list of things which gets updated frequently. This list is created by calling the server via jsonp, getting json back and transforming it into html. Fast and slick.
Unfortunately, Google isn't able to index it. After reading up on how to get this done according to Google's AJAX crawling guide, I am bit confused and need some clarification and confirmation:
The ajax pages need to be implement the rules only, right?
I currently have a rest url like
[site]/base/junkets/browse.aspx?page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
this would need to become something like:
[site]/base/junkets/browse.aspx#page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
And when google calls it like this
[site]/base/junkets/browse.aspx#!page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
I would have to deliver the html snapshot.
Why replace the ? with # ?
Creating html snapshots seems very cumbersome. Would it suffice to just serve simple links? In my case I would be happy if google would only index the things pages.
It looks like you've misunderstood the AJAX crawling guide. The #! notation is to be used on links to the page your AJAX application lives within, not on the URL of the service your appliction makes calls to. For example, if I access your app by going to example.com/app/, then you'd make page crawlable by instead linking to example.com/app/#!page=1.
Now when Googlebot sees that URL in a link, instead of going to example.com/app/#!page=1 – which means issuing a request for example.com/app/ (recall that the hash is never sent to the server) – it will request example.com/app/?_escaped_fragment_=page=1. If _escaped_fragment_ is present in a request, you know to return the static HTML version of your content.
Why is all of this necessary? Googlebot does not execute script (nor does it know how to index your JSON objects), so it has no way of knowing what ends up in front of your users after your scripts run and content is loaded. So, your server has to do the heavy lifting of producing a HTML version of what your users ultimately see in the AJAXy version.
So what are your next steps?
First, either change the links pointing to your application to include #!page=1 (or whatever), or add <meta name="fragment" content="!"> to your app's HTML. (See item 3 of the AJAX crawling guide.)
When the user changes pages (if this is applicable), you should also update the hash to reflect the current page. You could simply set location.hash='#!page=n';, but I'd recommend using the excellent jQuery BBQ plugin to help you manage the page's hash. (This way, you can listen to changes to the hash if the user manually changes it in the address bar.) Caveat: the currently released version of BBQ (1.2.1) does not support AJAX crawlable URLs, but the most recent version in the Git master (1.3pre) does, so you'll need to grab it here. Then, just set the AJAX crawlable option:
$.param.fragment.ajaxCrawlable(true);
Second, you'll have to add some server-side logic to example.com/app/ to detect the presence of _escaped_fragment_ in the query string, and return a static HTML version of the page if it's there. This is where Google's guidance on creating HTML snapshots might be helpful. It sounds like you might want to pursue option 3. You could also modify your service to output HTML in addition to JSON.
I've more or less given up on this. There really seems no alternative to generating the html on the server and delivering it in the html bdoy if you want goolge to index your directory.
I even tried adding a section wraped a .net user control which implemented a simple html version of the directory. But google also managed to ignore ..
So in the end my directory has been de-ajaxified. :(

Listing ajax data in search engines?

Is there any way to allow search engines to list JSON or XML ajax data ?
I don't think there is a way to directly allow crawlers to index XML and JSON.
I would recommend trying to design your site using progressive enhancement. First, make all of the JSON and XML available in HTML form for users who don't use javascript. These users include some people with disabilities and the crawlers used by search engines. That will ensure your content is searchable.
Once you have that working and tested, add your ajax functionality. You might do this by serving HTML, XML and JSON from a single URL using content negotiation, or you might have seperate URLs.
Another graceful solution is to implement your ajax calls as requests to full HTML pages and have your javascript only use the bit that it's interested in e.g. a div with id "content. The suitability of this solution would depend on your exact requirements.
Hmm, no, not really. Search engines crawl your HTML and they don't really bother clicking around or even just loading your page into a browser and having the AJAX magic happen. Flash and JSON objects are by themselves invisible to search engines, and to get them visible, you have to transform them in some HTML.
The newest technique for getting AJAX requests to be listed in search engines is to ensure they have their own URL. This technique stems from the same one utilized by flash applications where each page has a unique identifier, preceded by a pound (#) sign.
There are currently a few jQuery plugins which will allow you to manage this:
SWFAddress - Deep Linking for Flash & AJAX
jQuery History Plugin

Resources