Heroku Friendly Ruby Scraper (with AJAX)?

Heroku Friendly Ruby Scraper (with AJAX)? - ruby

So I'm trying to write a small unofficial API in Ruby to pull data from a site. I like Mechanize, but all the data I need from the page is generated by AJAX so Mechanize doesn't see it at all. What can I use to render a page with JavaScript so that I can scrape the data? I think something like spynner but for Ruby would do the trick.
I would also like to play with Heroku, so I'm looking for something that could be deployed there, which leads me away from something like Watir.
Does anything like this exist?
Update
For clarity, I'm trying to pull workout data from a Fitocracy profile page.
You may need an account before you can view the page, but basically all the workout data is displayed via JavaScript inside a page shell.

An ajax request is the same as a non-ajax request, it's just not always obvious how to make it. Mechanize can make any request that a browser can make. Sure Watir is easier but if this is for an API you should do it the right way and use mechanize.

Related

Getting Dynamically Generated HTML With Nokogiri/Open URI

I'm trying to scrape a site by looking at its HTML in Chrome and grabbing the data using Nokogiri. The problem is that some of the tags are dynamically generated, and they don't appear with an open(url) request when using open-uri. Is there a way to "force" a site to dynamically generate its content for a tool like open uri to read?

If reading it via open-uri doesn't produce the content you need, then chances are good that the client is generating content with Javascript.
This may be good news - by inspecting the AJAX requests that the page makes, you might find a JSON feed of the content you're looking for, which you can then request and parse directly. This would get you your data without having to dig through the HTML - handy!
If that doesn't work for some reason, though, you're going to need to open the page with some kind of browser, let it execute its clientside Javascript, then dump the resulting DOM to HTML. Something like PhantomJS is an excellent choice for this kind of work.

Is there a simple way to have Mechanize get all the components of a webpage?

It's my understanding that when I do:
agent = Mechanize.New
page = agent.get("http://www.stackoverflow.com/")
Mechanize will make an HTTP GET request for the text/html. However when I navigate to a webpage such as Stackoverflow.com in a full web browser (like Chrome/Firefox) the browser reads in the HTML page and makes subsequent GET requests for the associated CSS, images, JavaScript, etc.
I can imagine parsing the initial HTML returned by Mechanize and identifying any CSS, images, etc., and making subsequent requests, but is there an easier way of having Mechanize automatically grab all, or a specified group, perhaps just the images of the associated components of a web page?

I would take a look at the Mechanize::PluggableParsers that are available. One of them probably does what you want.

No, Mechanize won't do that. Besides, what would be the point of Mechanize retrieving non-text content it can't parse?
Instead, identify the parts you want, and use Net::HTTP, Curb, Open-URI, Typhoeus, or any of the other HTTP-based tools to retrieve the content and save it to disk.
Actually, unless I needed Mechanize to navigate through some forms first, or maintain sessions, I'd write a small Ruby script that uses Nokogiri to pull out the needed elements. If you have to use Mechanize for the initial navigation, it'll load Nokogiri automatically to handle its DOM parsing, so piggy-back on the Mechanize page it can give you, which is a Nokogiri::HTML document. Search through the related links on the right for more information.

What are the limitations of mechanize? and what is the difference(s) between mechanize and watir

I am using mechanize to scrap some web pages.
I need to know what are mechanize limitations? What mechanize can not do?
Can it execute javascripts embedded in the web page?
Can I use it to call javascript functions? I don't think it can. I think Watir can.
What are the differences between it and watir?

Mechanize can do a lot. It uses net/http so whatever you can do with net/http you can do with mechanize. Although it supports much more as per their description :
The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.
Check out this link for some information on using javascript with mechanize: here
It would be much easier to tell you if mechanize support a specific function/task instead of going through everything. What are you looking to do exactly ?
Javascript is the one thing mechanize can't do. The one thing it does support most of the time is displaying Javascript links. ie using page.links.each {|link| puts link.text} will also display Javascript, but you won't be able to click/select them.
In simple terms Watir does support Javascript. It's actually your browser that supports javascript and Watir controls the browser.
Watir runs a real browser(FF,Chrome,IE) and programmatically controls that browser. It acts exactly like a user would when accessing a website. This is what enables you to use javascript. Watir only controls the browser and the browser is the one sending requests and getting responses and rendering/processing it all. You are limited by the speed of the browser you use.
Mechanize on the other hand acts like its own 'browser' and is much faster than Watir, becomes it does not render pages. It talks directly with the server, and processes the raw response. Mechanize is limited by your connection speed.
Watir would be used over Mechanize when you need to watch and see what's happening, use javascript, or do anything GUI related. Mechanize is much faster and is good for testing the actual structure of the website. (testing links/logins/etc)

Google Search optimisation for ajax calls

I have a page on my site which has a list of things which gets updated frequently. This list is created by calling the server via jsonp, getting json back and transforming it into html. Fast and slick.
Unfortunately, Google isn't able to index it. After reading up on how to get this done according to Google's AJAX crawling guide, I am bit confused and need some clarification and confirmation:
The ajax pages need to be implement the rules only, right?
I currently have a rest url like
[site]/base/junkets/browse.aspx?page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
this would need to become something like:
[site]/base/junkets/browse.aspx#page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
And when google calls it like this
[site]/base/junkets/browse.aspx#!page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
I would have to deliver the html snapshot.
Why replace the ? with # ?
Creating html snapshots seems very cumbersome. Would it suffice to just serve simple links? In my case I would be happy if google would only index the things pages.

It looks like you've misunderstood the AJAX crawling guide. The #! notation is to be used on links to the page your AJAX application lives within, not on the URL of the service your appliction makes calls to. For example, if I access your app by going to example.com/app/, then you'd make page crawlable by instead linking to example.com/app/#!page=1.
Now when Googlebot sees that URL in a link, instead of going to example.com/app/#!page=1 – which means issuing a request for example.com/app/ (recall that the hash is never sent to the server) – it will request example.com/app/?_escaped_fragment_=page=1. If _escaped_fragment_ is present in a request, you know to return the static HTML version of your content.
Why is all of this necessary? Googlebot does not execute script (nor does it know how to index your JSON objects), so it has no way of knowing what ends up in front of your users after your scripts run and content is loaded. So, your server has to do the heavy lifting of producing a HTML version of what your users ultimately see in the AJAXy version.
So what are your next steps?
First, either change the links pointing to your application to include #!page=1 (or whatever), or add <meta name="fragment" content="!"> to your app's HTML. (See item 3 of the AJAX crawling guide.)
When the user changes pages (if this is applicable), you should also update the hash to reflect the current page. You could simply set location.hash='#!page=n';, but I'd recommend using the excellent jQuery BBQ plugin to help you manage the page's hash. (This way, you can listen to changes to the hash if the user manually changes it in the address bar.) Caveat: the currently released version of BBQ (1.2.1) does not support AJAX crawlable URLs, but the most recent version in the Git master (1.3pre) does, so you'll need to grab it here. Then, just set the AJAX crawlable option:
$.param.fragment.ajaxCrawlable(true);
Second, you'll have to add some server-side logic to example.com/app/ to detect the presence of _escaped_fragment_ in the query string, and return a static HTML version of the page if it's there. This is where Google's guidance on creating HTML snapshots might be helpful. It sounds like you might want to pursue option 3. You could also modify your service to output HTML in addition to JSON.

I've more or less given up on this. There really seems no alternative to generating the html on the server and delivering it in the html bdoy if you want goolge to index your directory.
I even tried adding a section wraped a .net user control which implemented a simple html version of the directory. But google also managed to ignore ..
So in the end my directory has been de-ajaxified. :(

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6

I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/

You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.

You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)

You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio