How to turn off validation in Nokogiri? - ruby

I need to insert HTML 5 video tag to some places of HTML document, being parsed with Nokogiri.
Since it does't supports HTML 5 (afaik), it throws an exception, because the document is not valid in terms of HTML 4.0.
Is it possible to switch the validation off ?

It would help if you would show some sample code demonstrating the problem, along with the error you are seeing.
Nokogiri should parse HTML fine as it uses a lenient mode for HTML. I switched to Nokogiri several years ago because I had some HTML and RSS feeds that caused Hpricot to explode. Nokogiri would occasionally get mad because a page was full of errors, but at least the were ways to get at it. Rescue the exception, then check your doc.errors to see what Nokogiri thinks the problem is.
Something like this should help:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body>...</body></html>')
puts doc.errors if (doc.errors.any?)
...

Related

Ruby: parsing message from confluence xml macro

I am trying to parse the message that says "this is a test"
<ac:structured-macro ac:name="warning"><ac:rich-text-body><strong>High</strong> This is a test!</ac:rich-text-body></ac:structured-macro>
I am using nokogiri in ruby and was able to parse this much and nothing else. To get this far, my code looks something like this:
xml = Nokogiri::XML(response)
body = xml.at("body").text
alert_body = alert[3]
I have wasted too many hours looking in the confluence rest api documentation and google for just general xml parsing.
The problems are:
There is no body tag in your example XML.
You're dealing with XML-Namespaces so your selector needs to change.
Your XML sample is incomplete since it's missing the line that would define the namespaces, so this is a bit of a hack but should give you an idea what needs to be done:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<foo xmlns:ac="http://www.w3.org/2005/Atom">
<ac:structured-macro ac:name="warning"><ac:rich-text-body><strong>High</strong> This is a test!</ac:rich-text-body></ac:structured-macro>
</foo>
EOT
doc.at('ac|rich-text-body').text # => "High This is a test!"
Namespaces are useful but they can be a major pain in the neck. Nokogiri makes it pretty easy to deal with them, especially when using CSS selectors. Read Nokogiri's "Searching an HTML / XML Document" page's "Namespaces" section for more information.

Getting HTML table values using XPath in Nokogiri?

I'm trying to get some values from a table using the XPath of this table but it only returns [] (empty):
require 'nokogiri'
require 'open-uri'
url = "http://riopretrans.com.br/linhas.php?ln=106"
doc = Nokogiri::HTML(open(url))
doc.xpath("html/body/table[1]/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/div/table[1]/tbody/tr[3]/td/div/div/center/font/table").each do |lines|
puts lines.content
end
I found the table's XPath using Firebug so I think it's correct.
Can anyone help me?
Remove tbody/ from your XPath.
The tbody tag is part of the HTML spec for table tags, but it's rarely actually implemented in the HTML. Some browsers insert it, though it's not in the HTML for the page. Firebug then sees it, which you see, and think it must be so.
Even using "view source" can confuse you, because you expect that to be accurate, but the browser has already munged the content to include "tbody", so, well, basically they're lying to you.
You can confirm this by looking at the HTML that Nokogiri is getting. Use puts doc.to_html['tbody'] and see if you get "tbody" or nil.
...Because in html file all of them were specified(written by programmer)
If you are positive they actually belong there, because they exist in the HTML source, then you'll need to take apart your XPath. Start with a broad path, and slowly add to it to narrow down your search.
The server is unreachable for me right now, so I can't confirm that, or dig into what the hierarchy should be, and show an example. (That's why actually giving us REAL HTML in your question is SO much better than a link which might not work.)
An alternate is to use XPath's // (search anywhere) with a less restrictive path, or CSS selectors. Either way, actually examine the HTML, instead of relying on Firebug's XPath, and determine what "landmarks" you can use in the source to navigate to your desired table. Today's HTML is chock-full of id and class parameters, or a particular series of tags that act as a finger-print for the table you want. Search for the minimum needed to pin-point that table.
If the table is something like <table id="foo">, then use doc.at('table#foo'). If it's in a <div class="bar"><table> use doc.at('div.bar table'). In any case, use the smallest sized accessor necessary to get the job done. That will increase your chances of success if anything in the HTML changes in the future.

How do I write a web scraper in Ruby?

I would like to crawl a popular site (say Quora) that doesn't have an API and get some specific information and dump it into a file - say either a csv, .txt, or .html formatted nicely :)
E.g. return only a list of all the 'Bios' of the Users of Quora that have, listed in their publicly available information, the occupation 'UX designer'.
How would I do that in Ruby ?
I have a moderate enough level of understanding of how Ruby & Rails work. I just completed a Rails app - mainly all written by myself. But I am no guru by any stretch of the imagination.
I understand RegExs, etc.
Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.
If you want something more high level, try wombat, which is this gem I built on top of Mechanize and Nokogiri. It is able to parse pages and follow links using a really simple and high level DSL.
I know the answer has been accepted, but Hpricot is also very popular for parsing HTML.
All you have to do is take a look at the html source of the pages and try to find a XPath or CSS expression that matches the desired elements, then use something like:
doc.search("//p[#class='posted']")
Mechanize is awesome. If you're looking to learn something new though, you could take a look at Scrubyt: https://github.com/scrubber/scrubyt. It looks like Mechanize + Hpricot. I've never used it, but it seems interesting.
Nokogiri is great, but I find the output messy to work with. I wrote a ruby gem to easily create classes off HTML: https://github.com/jassa/hyper_api
The HyperAPI gem uses Nokogiri to parse HTML with CSS selectors.
E.g.
Post = HyperAPI.new_class do
string title: 'div#title'
string body: 'div#body'
string author: '#details .author'
integer comments_count: '#extra .comment' do
size
end
end
# => Post
post = Post.new(html_string)
# => #<Post title: 'Hi there!', body: 'This blog post will talk about...', author: 'Bob', comments_count: 74>

Parsing an RSS item that has a colon in the tag with Ruby?

I'm trying to parse the info from an RSS feed that has this tag structure:
<dc:subject>foo bar</dc:subject>
using the built in Ruby RSS library. Obviously, doing item.dc:subject is throwing errors, but I can't figure out any way to pull out that info. Is there any way to get this to work? Or is it possible with a different RSS library?
Tags with ':' in them are really XML tags with a namespace. I never had good results using the RSS module because the feed formats often don't meet the specs, causing the module to give up. I highly recommend using Nokogiri to parse the feed, whether it is RDF, RSS or ATOM.
Nokogiri has the ability to use XPath accessors or CSS accessors, and, both support namespaces. The last two lines would be equivalent:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://somehost.com/rss_feed'))
doc.at('//dc:subject').text
doc.at('dc|subject').text
When dealing with namespaces you'll need to add the declaration to the XPath accessor:
doc.at('//dc:subject', 'dc' => 'link to dc declaration')
See the "Namespaces" section for more info.
Without a URL or a better sample I can't do more, but that should get you pointed in a better direction.
A couple years I wrote a big RSS aggregator for my job using Nokogiri that handled RDF, RSS and ATOM. Ruby's RSS library wasn't up to the task but Nokogiri was awesome.
If you don't want to roll your own, Paul Dix's Feedzirra is a good gem for processing feeds.
The RSS module seems to have the ability to do those XML namespace attributes, i.e. <dc:date> like this:
feed.items.each do |item|
puts "Date: #{item.dc_date}"
end
I think item['dc:subject'] might work.

Ruby XMLParsing Exception

I get a ParseException every time I try to parse a http get_response data in Ruby. The Exception is because of the presence of '&' in the data. How do I solve this?
Illegal character '&' in raw string (REXML::ParseException)
Is the data you're passing to the parser XML? Do other parsers complain about it?
Check to make sure that the data that you're trying to parse is well-formed XML. If you are trying to pass it HTML or RSS from the web, then it almost certainly isn't well-formed XML (HTML is not XML, though XHTML might be, and while RSS is supposed to be XML, there are lots of bad RSS generators out there that general RSS that is not well formed or invalid).
If you need to parse HTML, try Hpricot. If you need to parse RSS, use the built-in RSS parser; there are some examples here.
If you're trying to parse HTML consider using Nokogiri.
Nokogiri::HTML("<html>...</html>")
You can also try Nokogiri::XML but I believe that requires valid markup.

Resources