I really like the way Algolia has approached their multi-programming language API documentation, e.g. https://www.algolia.com/doc/javascript .
Does some know by chance what technologies do they use to generate it?
The documentation generator we're using is currently an internal tool. We might open-source it at some point, but it would require some extra work time which we don't currently have.
It's basically a markdown file with some extra syntax to:
handle multiple languages code blocks (it then automatically selects the good one)
handle conditions depending on the current language
handle callouts
handle buttons
The rendering is then hand-made with the help of bootstrap.
Sorry for this question. Anybody Can help me with some documentation or article about the difference between achecker and w3c validator?
Any info?
thanks
w.
W3C validator checks the markup validity of Web documents in HTML, XHTML, SMIL, MathML, etc. This tool checks the grammar of the document.
achecker is a basic tool to check a website according to WCAG recommendations (web content accessibility guidelines). As WCAG testing requires a lot of manual testing, and there is currently no good accessibility checker, those tools must be considered as indicators.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby.
Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a list of gems applicable to building a web crawler would be a great resource as well.
I used to write spiders, page scrapers and site analyzers for my job, and still write them periodically to scratch some itch I get.
Ruby has some excellent gems to make it easy:
Nokogiri is my #1 choice for the HTML parser. I used to use Hpricot, but found some sites that made it explode in flames. I switched to Nokogiri afterwards and have been very happy with it. I regularly use it for parsing HTML, RDF/RSS/Atom and XML. Ox looks interesting too, so that might be another candidate, though I find searching the DOM a lot easier than trying to walk through a big hash, such as what is returned by Ox.
OpenURI is good as a simple HTTP client, but it can get in the way when you want to do more complex things or need to have multiple requests firing at once. I'd recommend looking at HTTPClient or Typhoeus with Hydra for modest to heavyweight jobs. Curb is good too, because it uses the cURL library, but the interface isn't as intuitive to me. It's worth looking at though. HTTPclient is also worth looking at, but I lean toward the previously mentioned ones.
Note: OpenURI has some flaws and vulnerabilities that can affect unsuspecting programmers so it's fallen out of favor somewhat. RestClient is a very worthy successor.
You'll need a backing database, and some way to talk to it. This isn't a task for Rails per se, but you could use ActiveRecord, detached from Rails, to talk to the database. I've done that a couple times and it works all right. Instead, I really like Sequel for my ORM. It's very flexible in how it lets you talk to the database, from using straight SQL to using Sequel's ability to programmatically build a query, to modeling the database and using migrations. Once you have the database built, you could use Rails to act as a front-end to the data though.
If you are going to navigate sites in any way beyond simply grabbing pages and following links, you'll want to look at Mechanize. It makes it easy to fill out forms and submit pages. As an added bonus, you can grab the content of a page as a Nokogiri HTML document and parse away using Nokogiri's multitude of tricks.
For massaging/mangling URLs I really like Addressable::URI. It's more full-featured than the built-in URI module. One thing that URI does that's nice is it has the URI#extract method to scan a string for URLs. If that string happened to be the body of a web page it would be an alternate way of locating links, but its downside is you'll also get links to images, videos, ads, etc., and you'll have to filter those out, probably resulting in more work than if you use a parser and look for <a> tags exclusively. For that matter, Mechanize also has the links method which returns all the links in a page, but you'll still have to filter them to determine whether you want to follow or ignore them.
If you think you'll need to deal with Javascript manipulated pages, or pages that get their content dynamically from AJAX, you should look into using one of the WATIR variants. There are flavors for the different browsers on different OSes, such as Firewatir, Safariwatir and Operawatir, so you'll have to figure out what works for you.
You do NOT want to rely on keeping your list of URLs to visit, or visited URLs, in memory. Design a database schema and store that information there. Spend some time up front designing the schema, thinking about what things you'll want to know as you collect links on a site. SQLite3, MySQL and Postgres are all excellent choices, depending on how big you think your database needs will be. One of my site analyzers was custom designed to help us recommend SEO changes for a Fortune 50 company. It ran for over three weeks covering about twenty different sites before we had enough data and stopped it. Imagine what would have happened if we had a power-outage and all that data went in the bit-bucket.
After all that you'll want to also make your code be aware of proper spidering etiquette: What are the key considerations when creating a web crawler?
I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat
It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.
So you want a good Ruby-based web crawler?
Try spider or anemone. Both have solid usage according to RubyGems download counts.
The other answers, so far, are detailed and helpful but they don't have a laser-like focus on the question, which asks for ruby libraries for web crawlers. It would seem that this distinction can get muddled: see my answer to "Crawling vs. Web-Scraping?"
Tin Man's comprehensive list is good but partly outdated for me.
Most websites my customers deal with are heavily AJAX/Javascript dependent.
I've been using Watir / watir-webdriver / selenium for a few years too, but the overhead of having to load up a hidden web browser on the backend to render that DOM stuff just isn't viable, let alone that all this time they still haven't implemented a useable "browser session reuse" to let new code execution reuse an old browser in memory for this purpose, shooting down tickets that might have worked their way up the API layers eventually. (refering to https://code.google.com/p/selenium/issues/detail?id=18 ) **
https://rubygems.org/gems/phantomjs
is what we're migrating new projects over to now, to let the necessary data get rendered without even any sort of invisible Xvfb memory & CPU heavy web browser.
** Alternative approaches also failed to pan out:
how to serialize an object using TCPServer inside?
Can a watir browser object be re-used in a later Ruby process?
If you don't want to write your own, then use any ordinary web crawler. There are dozens out there.
If you do want to write your own, then write your own. A web crawler isn't exactly a complicated activity, it consists of:
Downloading a website.
Locating URLs in that website, filtered however you dang well please.
For each URL in that website, repeat step 1.
Oh, and this seems to be a duplicate of "Web crawler in ruby".
I've played around with Google Sitemaps on a couple sites. The lastmod, changefreq, and priority parameters are pretty cool in theory. But in practice I haven't seen these parameters affect much.
And most of my sites don't have a Google Sitemap and that has worked out fine. Google still crawls the site and finds all of my pages. The old meta robot and robots.txt mechanisms still work when you don't want a page (or directory) to be indexed. And I just leave every other page alone and as long as there's a link to it Google will find it.
So what reasons have you found to write a Google Sitemap? Is it worth it?
From the FAQ:
Sitemaps are particularly helpful if:
Your site has dynamic content.
Your site has pages that aren't easily
discovered by Googlebot during the
crawl process—for example, pages
featuring rich AJAX or images.
Your site is new and has few links to it.
(Googlebot crawls the web by
following links from one page to
another, so if your site isn't well
linked, it may be hard for us to
discover it.)
Your site has a large
archive of content pages that are not
well linked to each other, or are not
linked at all.
It also allows you to provide more granular information to Google about the relative importance of pages in your site and how often the spider should come back. And, as mentioned above, if Google deems your site important enough to show sublinks under in the search results, you can control what appears via sitemap.
I believe the "special links" in search results are generated from the google sitemap.
What do I mean by "special link"? Search for "apache", below the first result (Apache software foundation) there are two columns of links ("Apache Server", "Tomcat", "FAQ").
I guess it helps Google to prioritize their crawl? But in practice I was involved in a project where we used the gzip-ed large version of it where it helped massively. And AFAIK there is a nice integration with webmaster tools as well.
I am also curious about the topic, but does it cost anything to generate a sitemap?
In theory, anything that costs nothing and may have a potential gain, even if very small or very remote, can be defined as "worth it".
In addition, Google says: "Tell us about your pages with Sitemaps: which ones are the most important to you and how often they change. You can also let us know how you would like the URLs we index to appear." (Webmaster Tools)
I don't think that the bold statement above is possible with the traditional mechanisms that search engines use to discover URLs.
I wonder how to get at tags in blog posts (WordPress, Blogger, or Blogspot) programmatically (API, RSS feed, XML, other methods). Preferably a solution usable in Ruby on Rails.
See what the life streaming Rails apps such as kakuteru are using. Tagging across multiple Web2.0 style streams is important to kakuteru and I think they may employ a number of techniques. The also employ zemanta which has an API to generate tags from content. You can see zemanta's example of getting tags using ruby
If either of those have an API, i'd start by reading their documentation; outside of that; you can roll your own screenscraper (which is likely to be useless, given the amount of DOM content that is script generated these days)
Wordpress API
Blogger API