What are some good Ruby-based web crawlers? [closed] - ruby

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby.
Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a list of gems applicable to building a web crawler would be a great resource as well.

I used to write spiders, page scrapers and site analyzers for my job, and still write them periodically to scratch some itch I get.
Ruby has some excellent gems to make it easy:
Nokogiri is my #1 choice for the HTML parser. I used to use Hpricot, but found some sites that made it explode in flames. I switched to Nokogiri afterwards and have been very happy with it. I regularly use it for parsing HTML, RDF/RSS/Atom and XML. Ox looks interesting too, so that might be another candidate, though I find searching the DOM a lot easier than trying to walk through a big hash, such as what is returned by Ox.
OpenURI is good as a simple HTTP client, but it can get in the way when you want to do more complex things or need to have multiple requests firing at once. I'd recommend looking at HTTPClient or Typhoeus with Hydra for modest to heavyweight jobs. Curb is good too, because it uses the cURL library, but the interface isn't as intuitive to me. It's worth looking at though. HTTPclient is also worth looking at, but I lean toward the previously mentioned ones.
Note: OpenURI has some flaws and vulnerabilities that can affect unsuspecting programmers so it's fallen out of favor somewhat. RestClient is a very worthy successor.
You'll need a backing database, and some way to talk to it. This isn't a task for Rails per se, but you could use ActiveRecord, detached from Rails, to talk to the database. I've done that a couple times and it works all right. Instead, I really like Sequel for my ORM. It's very flexible in how it lets you talk to the database, from using straight SQL to using Sequel's ability to programmatically build a query, to modeling the database and using migrations. Once you have the database built, you could use Rails to act as a front-end to the data though.
If you are going to navigate sites in any way beyond simply grabbing pages and following links, you'll want to look at Mechanize. It makes it easy to fill out forms and submit pages. As an added bonus, you can grab the content of a page as a Nokogiri HTML document and parse away using Nokogiri's multitude of tricks.
For massaging/mangling URLs I really like Addressable::URI. It's more full-featured than the built-in URI module. One thing that URI does that's nice is it has the URI#extract method to scan a string for URLs. If that string happened to be the body of a web page it would be an alternate way of locating links, but its downside is you'll also get links to images, videos, ads, etc., and you'll have to filter those out, probably resulting in more work than if you use a parser and look for <a> tags exclusively. For that matter, Mechanize also has the links method which returns all the links in a page, but you'll still have to filter them to determine whether you want to follow or ignore them.
If you think you'll need to deal with Javascript manipulated pages, or pages that get their content dynamically from AJAX, you should look into using one of the WATIR variants. There are flavors for the different browsers on different OSes, such as Firewatir, Safariwatir and Operawatir, so you'll have to figure out what works for you.
You do NOT want to rely on keeping your list of URLs to visit, or visited URLs, in memory. Design a database schema and store that information there. Spend some time up front designing the schema, thinking about what things you'll want to know as you collect links on a site. SQLite3, MySQL and Postgres are all excellent choices, depending on how big you think your database needs will be. One of my site analyzers was custom designed to help us recommend SEO changes for a Fortune 50 company. It ran for over three weeks covering about twenty different sites before we had enough data and stopped it. Imagine what would have happened if we had a power-outage and all that data went in the bit-bucket.
After all that you'll want to also make your code be aware of proper spidering etiquette: What are the key considerations when creating a web crawler?

I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat
It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.

So you want a good Ruby-based web crawler?
Try spider or anemone. Both have solid usage according to RubyGems download counts.
The other answers, so far, are detailed and helpful but they don't have a laser-like focus on the question, which asks for ruby libraries for web crawlers. It would seem that this distinction can get muddled: see my answer to "Crawling vs. Web-Scraping?"

Tin Man's comprehensive list is good but partly outdated for me.
Most websites my customers deal with are heavily AJAX/Javascript dependent.
I've been using Watir / watir-webdriver / selenium for a few years too, but the overhead of having to load up a hidden web browser on the backend to render that DOM stuff just isn't viable, let alone that all this time they still haven't implemented a useable "browser session reuse" to let new code execution reuse an old browser in memory for this purpose, shooting down tickets that might have worked their way up the API layers eventually. (refering to https://code.google.com/p/selenium/issues/detail?id=18 ) **
https://rubygems.org/gems/phantomjs
is what we're migrating new projects over to now, to let the necessary data get rendered without even any sort of invisible Xvfb memory & CPU heavy web browser.
** Alternative approaches also failed to pan out:
how to serialize an object using TCPServer inside?
Can a watir browser object be re-used in a later Ruby process?

If you don't want to write your own, then use any ordinary web crawler. There are dozens out there.
If you do want to write your own, then write your own. A web crawler isn't exactly a complicated activity, it consists of:
Downloading a website.
Locating URLs in that website, filtered however you dang well please.
For each URL in that website, repeat step 1.
Oh, and this seems to be a duplicate of "Web crawler in ruby".

Related

AJAX Crawling with question mark instead of hashbang

Where I'm at: I've read Google's documentation regarding it's AJAX crawling, and I've searched around a bit in this website and others, but I'm quite confused, as it seems that all problems address the same issue: AJAX crawing with hashbangs?
I've developed an app which, among other purposes, let's the user search for locations worldwide, using an AJAX searcher quite similar to Google's, but my app uses exclusively the question mark in AJAX, instead of hashbang. Due to compatibility issues, changing it to the hashbang is not an option.
Not only am I largely confused by the fact that I could not find anyone else using the question mark instead of the hashbang, I'm also wondering if there is any documentation regarding my issue: how to let google bot crawl all my AJAX content when I'm using the question mark instead of a hashbang in my AJAX app.
The AJAX crawling schema was created explicitly for applications and websites using hashbang (#!) in the URL structure, because the fragment part of the URLs only exist on the client side; the URL rewriting in the specs, i.e. from #! to ?_escaped_fragment_= is meant to solve that.
Since most of the web is already making use of Javascript in a way or other, we (Google) needed a better solution, so we started executing Javascript in the pages we crawled and effectively render every page, just like a normal browser would. To quote our blogpost, Understanding web pages better:
In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.
You can also see what we "see" using Fetch as Google in Search Console (former Webmaster Tools); read more about the feature in our post titled Rendering pages with Fetch as Google
Before you do anything else, please try to fetch a few pages from your site with Fetch as Google. You might not have to do anything at all, it might actually work out of the box. And the good news is that it's not only Google that's rendering pages!

Beginner looking for suggestion: building my first website with GAE, picking framework and looking for AJAX learning resources [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have zero web programming experience but has been in IT industry for a while, mainly as a CRM technical consultant. I'm familiar with VBScript and Javascript, not in a Web context but as general scripting tools. I'm good at designing business processes, database models and using DB queries. I have some basic understanding of GAE and Python by doing the tutorials by Google. I used to write some tools with C# and VB6 a long time ago.
So I've decided to build my first website on Google AppEngine, and I'm lost in so many choices and new skills to learn.
What I'm planning to build is a simple website where users can post short messages and vote upon them. Which requires a simple but dynamic front page, login/cookie handling, Reddit like post voting/aging and some data storage.
Maybe the first question is which framework should I use? I heard Flask is good for beginner to learn web programming and webapp2 is easy to start since it's integrated to GAE by default. I've looked at Django as well, it looks very powerful, but I couldn't decide.
Since my idea is largely based on a concise but dynamic front page, I guess something AJAX is a must. But I have no clue on where to start. All those Ajax, Jquery, ProtoRPC are so confusing. Which technologies should I use and where can I find good tutorials?
I am also looking for suggestions on potential challenges and anything I should learn to achieve my goal. Thanks!
Since your project is inspired by reddit, the web development course with Steve Huffman (the technical founder of reddit) will be extremely helpful for you.
https://www.udacity.com/course/cs253 - it's free if you just watch the courseware. He even explains their aging algorithms in the end.
This course covers the back-end side of building a python applicaiton with the default webapp2 framework on appengine. He doesn't cover the front-end besides the basics (HTML forms and tables, stuff like this).
Now, Jquery is a Javascript library which is used by about all of the dynamic websites. It is a convenient way to work with the DOM on the fly. Everything you can do with jQuery, you can do with plain javascript, it's just that jQuery is infinitely easier to work with.
This library is used on the front-end, and it doesn't matter what backend you choose. It is extremely simple and powerful and you can learn the basics at the free codeschool course try.jquery.com.
Basically, if you want something to happen on the page dynamically (the arrow becomes red once the user clicked on it), you use jQuery.
AJAX is asynchronous communication with the server, it can be done by plain javascript but jQuery provides a very convenient wrapper for doing it. Usecase: the user clicked on the arrow, you painted it in red with jQuery, you incremented the votes counter (again with jQuery), and now you need to send the upvote to the server without reloading the page. For this you perform jQuery.ajax() call, and pass the user data as a param.
So to wrap it up: you need to write javascript to make a dynamic page and jQuery is the most common library that helps you with this. You need AJAX to get and post data to the server without page refreshes, this is implemented in jQuery. You can use jQuery with any back-end framework that you choose. Start with the simple jQuery tutorial, then read about $.ajax call and it will be clear for you.

Testing links in a web app

I need to test links inside a web application. I have looked at a couple tools (Xenu, various browser plugins, link-checker(ruby)). Nothing quite fits my needs which I will detail below.
I need to get past a login form
test needs to be rerun for different types of users (multiple sets of login credentials)
would like to automate this under a ci server (Jenkins)
the ability to spider the site
Does anyone have any ideas? Bonus if I can use Ruby to do this!
What you are asking for is beyond most of the test tools once you throw in the ability to spider the site.
That final requirement pushes you into the realm of hand-coding something. Using the Mechanize gem you could do all those things, but, you get to code a lot of the navigation of the site.
Mechanize uses Nokogiri internally, so it's easy to grab all links in a page, which you could store in a database to be checked by a different thread, or some subsequent code. That said, writing a spider is not hard if you're the owner of the pages you're hitting, because you can be pretty brutal about accessing the server and let the code run at full speed without worrying about being banned for excessive bandwidth use.

AngularJS / AJAX app and search engine crawlers

I've got a web app which heavily uses AngularJS / AJAX and I'd like it to be crawlable by Google and other search engines. My understanding is that I need to do something special to make it work, as described here: https://developers.google.com/webmasters/ajax-crawling
Unfortunately, that looks quite nasty and I'd rather not introduce the hash tags. What I'd like to do is to serve a static page to Googlebot (based on the User-Agent), either directly or by sending it a 302 redirect. That way, the web app can be the same, and the whole Googlebot workaround is nicely isolated until it is no longer necessary.
My worry is that Google may mistakenly assume that I'm trying to trick Googlebot, while my goal is to help it. What do you guys think about this approach, and what would you recommend?
Recently I come upon this excellent post from yearofmoo, explaining in details how to make your Angular app SEO friendly. In essence, when bots see an uri with a hash tag they will know it's an ajaxed page and will try to reach the same uri by replacing '#!' in your uri with '?_escaped_fragment_='. This alternative uri instructs bots that they should expect to find a definitive static version of the page they were accessing.
Of course, to achieve this you'd have to introduce hash tags into your uris. I don't see why are you trying to avoid them. Isn't gmail using hash tags?
Yeah unfortunately, if you want to be indexed - you have to adhere to the scheme :( If your running a ruby app - there's a gem that implements the crawling scheme for any rack app....
gem install google_ajax_crawler
writeup of how to use it is at http://thecodeabode.blogspot.com.au/2013/03/backbonejs-and-seo-google-ajax-crawling.html, source code at https://github.com/benkitzelman/google-ajax-crawler
Have a look at these links and it will give you a good direction:
Set up your own Prerender service using Prerender.io open source code:
https://prerender.io/
Use a different existing service such as BromBone, Seo.js or SEO4AJAX:
http://www.brombone.com/
http://getseojs.com/
http://www.seo4ajax.com/
Create your own service for rendering and serving snapshots to search engines. Read this article. It will give you the big picture:
http://scotch.io/tutorials/javascript/angularjs-seo-with-prerender-io
As of May 2014 GoogleBot now executes JavaScript. Check WebmasterTools to see how Google sees your site.
http://googlewebmastercentral.blogspot.no/2014/05/understanding-web-pages-better.html
Edit: Note that this does not mean other crawlers (Bing, Facebook, etc.) will execute Javascript. You may still need to take additional steps to ensure that these crawlers can see your site.

Pros and Cons of an all Ajax Site? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
So I actually saw a full ajax site somewhere (I forget where) and thought it would be something new and fun to try. I used an old site I had built and put it on a new server. With a little bit of jquery and ajax, I was able to make the entire site work on one page load.
My question is, what are some pros and (more likely) cons to this method?
Please note - the site works through a semi clever linking function. Everything works perfectly fine if the user doesn't have javascript enabled, the newly requested page loads like it would on any other website.
More detail -- Say the user loads the homepage of the site, then logs in. When they log in, the login box fades and reappears with user info. Other content on the page loads as necessary upon logging in. If they click a link, lets say "Articles", one column on the homepage slides up and slides back down with the articles. If they click home the articles slide up and the homepage content slides back down. Things like posting comments, viewing profiles, voting on things, etc. are all done through ajax.
Is this a bad method of web design? If so, why?
I am open to all answers/opinions.
IMO, this isn't "bad" or "good". That depends completely on whether or not the website fulfills the requirements. Oftentimes, developers working on AJAX-only sites tend to miss the whole negative SEO impact issue. However, if the site is developed to support progressive enhancement (or graceful degradation depending on your point of view), which it sounds like you have, then you're good. Only things to prepare for are times when the AJAX call can't complete as expected (make sure you're dealing with timeouts, broken links, etc) so the user doesn't get stuck staring at a loading icon. (The kind of stuff you'd have to deal with in any application, really.)
There are plenty of single-page websites out there using heavy JS and AJAX for the UI and they are great. Specifically, I know of portfolio sites for web designers and web app development teams that use this approach. Oftentimes, the app feels a bit like a flash app, but without the need for a special plugin.
"Is this a bad method of web design? If so, why?"
Certainly not. In fact, making web-pages behave more like desktop applications, whilst remaining functional to ALL users, is the holy-grail of web-design.
I say, as long as you consider ALL your users, i.e. mobile/text-only/low bandwidth/small screensizes then you will be fine. Too many developers just do it for their huge 19" screens and 10Mbps, that users to get left behind through almost no fault of their own.
It depends on the user
This relates closely to UX, IMHO, though of course it's on-topic for programming solutions.
All-AJAX is often called "managing state" 12 years after this Question was asked.
From my experience in:
Creating a platform for API plugins
Creating two of my own CMS web apps for different purposes
Managing many different WordPress.org sites for different purposes
Managing my own cloud servers for both PHP-AJAX and Node.js doing these calls
...it depends on what is most efficient for users.
Consider these scenarios:
Will users be clicking around this website all day long or for at least an hour adjusting many different options and <form> inputs?
Or will many users visit briefly to perform just a handful of quick tasks?
State-managed / all-AJAX is by far best for scenario 1, with Facebook and Gmail as prime examples.
Whole-page loads are more efficient for scenario 2, like blogs, especially with pages linked directly from search results. That might apply to webstores like Amazon, maybe, where users search Google to find one or two products, then leave.
Philosophically, I've heard that the difference is about the number of users and traffic, but I don't quite agree. It's more about how much clicking and <form> sending the primary target user will be doing.

Resources