Set "wait" in KNIME scraper - ajax

I'm building a news scraper for a project, and I found my way through most of the sites, but one is giving me the headache, because whenever I try to bulk-extract the articles contents, most of html of the links won't load.
I even tried in python, same obsolete results.
My question is:
how can I set a "wait until content is loaded"? I am reading that some Ajax thing may be needed to load first.

I think what you are looking for are the Selenium Nodes. They are particularly targeted for extracting data from Ajax-based websites, where content is loaded via JavaScript code.
You can find some example workflows e.g. here:
https://nodepit.com/server/seleniumnodes.com

Related

AJAX Crawling with question mark instead of hashbang

Where I'm at: I've read Google's documentation regarding it's AJAX crawling, and I've searched around a bit in this website and others, but I'm quite confused, as it seems that all problems address the same issue: AJAX crawing with hashbangs?
I've developed an app which, among other purposes, let's the user search for locations worldwide, using an AJAX searcher quite similar to Google's, but my app uses exclusively the question mark in AJAX, instead of hashbang. Due to compatibility issues, changing it to the hashbang is not an option.
Not only am I largely confused by the fact that I could not find anyone else using the question mark instead of the hashbang, I'm also wondering if there is any documentation regarding my issue: how to let google bot crawl all my AJAX content when I'm using the question mark instead of a hashbang in my AJAX app.
The AJAX crawling schema was created explicitly for applications and websites using hashbang (#!) in the URL structure, because the fragment part of the URLs only exist on the client side; the URL rewriting in the specs, i.e. from #! to ?_escaped_fragment_= is meant to solve that.
Since most of the web is already making use of Javascript in a way or other, we (Google) needed a better solution, so we started executing Javascript in the pages we crawled and effectively render every page, just like a normal browser would. To quote our blogpost, Understanding web pages better:
In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.
You can also see what we "see" using Fetch as Google in Search Console (former Webmaster Tools); read more about the feature in our post titled Rendering pages with Fetch as Google
Before you do anything else, please try to fetch a few pages from your site with Fetch as Google. You might not have to do anything at all, it might actually work out of the box. And the good news is that it's not only Google that's rendering pages!

One-page AJAX-based WordPress site. How should I do it?

I am trying to create a one-page WordPress website, something like the ones you sometimes see in ThemeForest's WP section: the whole website is a long page that has everything in one place, from about us, to portfolio, to some blog posts, to contacts.
Placing all things on one page is not difficult. But when I started thinking about how to present individual posts and pages, I realised that I probably need a general way of getting posts' data via AJAX, and create new blocks with JS. How should I go about this? I suppose this was done before, but I struggle to find something this specific on Codex or a tutorial with best practices.
Any advice or link will be greatly appreciated.
You could use a plugin such as jQuery Easytabs, download it here, that has a built-in Ajax component.
I've found that the easiest way is to just get all content to load into the divs ahead of time, vs. trying to load all pages through Ajax. However, appending something like '?ajax/ajax' to the end of your urls through the Easytabs plugin is one option that I have successfully used in the past.
If you decide to use the easytabs functionality, there is ample documentation on the page that I linked to.

Purely JavaScript Solution for Google Ajax Crawlable Spec

I have a project which is heavily JavaScript based (e.g. node.js, backbone.js, etc.). I'm using hashbang urls like /#!/about and have read the google ajax crawlable spec. I've done a wee bit of headless UI testing with zombie and can easily conceive of how this could be done by setting a slight delay and returning static content back to the google bot. But I don't really want to implement this from scratch and was hoping there was a pre-existing library that fits in with my stack. Know of one?
EDIT: At time of writing I don't think this exists. However, rendering using backbone (or similar) on server and client is a plausible approach (even if not a direct answer). So I'm going to mark that as answer although there may be better solutions in the future.
Just to chime in, I ran into this issue too (I have very ajax/js heavy site), and I found this which may be of interest:
crawlme
I have yet to try it but it sounds like it will make the whole process a piece of cake if it works as advertised! it's a piece of connect/express middleware that is simply inserted before any calls to pages, and apparently takes care of the rest.
Edit:
Having tried crawlme, I had some success, but the backend headless browser it uses (zombie.js) was failing with some of my javascript content, likely because it works by emulting the DOM and thus won't be perfect.
Sooo, instead I got hold of a full webkit based headless browser, phantomjs, and a set of node linkings for it, like this:
npm install phantomjs node-phantom
I then created my own script similar to crawlme, but using phantomjs instead of zombie.js. This approach seems to work perfectly, and will render every single one of my ajax based pages perfectly. the script I wrote to pull this off can be found here. to use it, simply:
var googlebot = require("./path-to-file");
and then before any other calls to your app (this is using express but should work with just connect too:
app.use(googlebot());
the source is realtively simple minus a couple of regexps, so have a gander :)
Result: AJAX heavy node.js/connect/express based website can be crawled by the googlebot.
There is one implementation using node.js and Backbone.js on the server and browser
https://github.com/Morriz/backbone-everywhere
crawleable nodejs module seems to fit this purpose: https://npmjs.org/package/crawlable and example of such SPA that can be rendered server-side in node https://github.com/trupin/crawlable-todos
Backbone looks interesting:
http://documentcloud.github.com/backbone/
http://lostechies.com/derickbailey/2011/09/26/seo-and-accessibility-with-html5-pushstate-part-1-introducing-pushstate/

Google Suggest, how it works?

How does Google Suggest work? How does it manage to update the web page on the client so quickly, based on information in a distant Google database? Why does the web page not look ‘jumpy’ if it is being frequently updated?
It uses AJAX.
When you are writing your query, it searches for the 10 most requested words matching yours. Then it writes minified JSON on an invisible DIV element. Fast, but still resource intensive.
Try to install Firebug on Firefox or use the Developer Console on Chrome, open the console and start writing "Youtube" or whatever you want. You will see the minified JSON responses.
Good luck :D
In addition to the front-end handling others have talked about, which jQuery is a great example of, you might also be interested in how they approach the idea on the backend. Dr. Peter Norvig has written about how to create a spelling corrector, where similar approaches could be used to find close matches.
The whole page is not being updated. Only parts of it are using AJAX - Asynchronous Javascript and XML. Ajax requests can be made in Javascript, and the page updated when the response comes back.
A far more interesting question is how does Google actually search 10bn+ documents in a teeny tiny fraction of a second :)

What are some good Ruby-based web crawlers? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby.
Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a list of gems applicable to building a web crawler would be a great resource as well.
I used to write spiders, page scrapers and site analyzers for my job, and still write them periodically to scratch some itch I get.
Ruby has some excellent gems to make it easy:
Nokogiri is my #1 choice for the HTML parser. I used to use Hpricot, but found some sites that made it explode in flames. I switched to Nokogiri afterwards and have been very happy with it. I regularly use it for parsing HTML, RDF/RSS/Atom and XML. Ox looks interesting too, so that might be another candidate, though I find searching the DOM a lot easier than trying to walk through a big hash, such as what is returned by Ox.
OpenURI is good as a simple HTTP client, but it can get in the way when you want to do more complex things or need to have multiple requests firing at once. I'd recommend looking at HTTPClient or Typhoeus with Hydra for modest to heavyweight jobs. Curb is good too, because it uses the cURL library, but the interface isn't as intuitive to me. It's worth looking at though. HTTPclient is also worth looking at, but I lean toward the previously mentioned ones.
Note: OpenURI has some flaws and vulnerabilities that can affect unsuspecting programmers so it's fallen out of favor somewhat. RestClient is a very worthy successor.
You'll need a backing database, and some way to talk to it. This isn't a task for Rails per se, but you could use ActiveRecord, detached from Rails, to talk to the database. I've done that a couple times and it works all right. Instead, I really like Sequel for my ORM. It's very flexible in how it lets you talk to the database, from using straight SQL to using Sequel's ability to programmatically build a query, to modeling the database and using migrations. Once you have the database built, you could use Rails to act as a front-end to the data though.
If you are going to navigate sites in any way beyond simply grabbing pages and following links, you'll want to look at Mechanize. It makes it easy to fill out forms and submit pages. As an added bonus, you can grab the content of a page as a Nokogiri HTML document and parse away using Nokogiri's multitude of tricks.
For massaging/mangling URLs I really like Addressable::URI. It's more full-featured than the built-in URI module. One thing that URI does that's nice is it has the URI#extract method to scan a string for URLs. If that string happened to be the body of a web page it would be an alternate way of locating links, but its downside is you'll also get links to images, videos, ads, etc., and you'll have to filter those out, probably resulting in more work than if you use a parser and look for <a> tags exclusively. For that matter, Mechanize also has the links method which returns all the links in a page, but you'll still have to filter them to determine whether you want to follow or ignore them.
If you think you'll need to deal with Javascript manipulated pages, or pages that get their content dynamically from AJAX, you should look into using one of the WATIR variants. There are flavors for the different browsers on different OSes, such as Firewatir, Safariwatir and Operawatir, so you'll have to figure out what works for you.
You do NOT want to rely on keeping your list of URLs to visit, or visited URLs, in memory. Design a database schema and store that information there. Spend some time up front designing the schema, thinking about what things you'll want to know as you collect links on a site. SQLite3, MySQL and Postgres are all excellent choices, depending on how big you think your database needs will be. One of my site analyzers was custom designed to help us recommend SEO changes for a Fortune 50 company. It ran for over three weeks covering about twenty different sites before we had enough data and stopped it. Imagine what would have happened if we had a power-outage and all that data went in the bit-bucket.
After all that you'll want to also make your code be aware of proper spidering etiquette: What are the key considerations when creating a web crawler?
I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat
It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.
So you want a good Ruby-based web crawler?
Try spider or anemone. Both have solid usage according to RubyGems download counts.
The other answers, so far, are detailed and helpful but they don't have a laser-like focus on the question, which asks for ruby libraries for web crawlers. It would seem that this distinction can get muddled: see my answer to "Crawling vs. Web-Scraping?"
Tin Man's comprehensive list is good but partly outdated for me.
Most websites my customers deal with are heavily AJAX/Javascript dependent.
I've been using Watir / watir-webdriver / selenium for a few years too, but the overhead of having to load up a hidden web browser on the backend to render that DOM stuff just isn't viable, let alone that all this time they still haven't implemented a useable "browser session reuse" to let new code execution reuse an old browser in memory for this purpose, shooting down tickets that might have worked their way up the API layers eventually. (refering to https://code.google.com/p/selenium/issues/detail?id=18 ) **
https://rubygems.org/gems/phantomjs
is what we're migrating new projects over to now, to let the necessary data get rendered without even any sort of invisible Xvfb memory & CPU heavy web browser.
** Alternative approaches also failed to pan out:
how to serialize an object using TCPServer inside?
Can a watir browser object be re-used in a later Ruby process?
If you don't want to write your own, then use any ordinary web crawler. There are dozens out there.
If you do want to write your own, then write your own. A web crawler isn't exactly a complicated activity, it consists of:
Downloading a website.
Locating URLs in that website, filtered however you dang well please.
For each URL in that website, repeat step 1.
Oh, and this seems to be a duplicate of "Web crawler in ruby".

Resources