Parse XHTML using Ruby - ruby

Is there any way I can parse a remote html page, in Ruby, preferably using jQuery like selectors?
For example, I could select all the div having a specific class, and get the content of all those elements in an array.
I was trying to use Regex for this, but I think using XML parser would be better.

I found hpricot is very similar.

Related

Letting Nokogiri decide whether to use #fragment or #parse

I have a piece of HTML that I would like to parse with Nokogiri, but I do not know whether it is a full HTML document (with DOCTYPE, etc) or a fragment (e.g. just a div with some elements in it).
This makes a difference for Nokogiri, because it should use #fragment for parsing fragments but #parse for parsing full documents.
Is there a way to determine whether a given piece of text is a fragment or a full HTML document?
Denis
Depends on how trashed your page is, but
/^(?:\s*<!DOCTYPE)|(?:\s*<html)/
should work in most cases.
The simplest way would be to look for the mandatory <html> tag, using for instance a regular expression /<html[\s>])/ (allowing attributes).
Is this sufficient to solve your problem?

XPath queries using HtmlAgilityPack fails to select notes with self closing tags

I'm trying to query all input nodes. All of the nodes that are not self-closing are being returned fine, but the nodes that are self-closing are not. Is there a way to address this that doesn't require me to changes the HTML?
Thanks!
This is the default behavior. If you want to change it, you need to play with the ElementFlags collection, and for example, just remove INPUT from it, just like I explained for OPTION on a similar question here on SO: XHTML Parsing with HTMLAgilityPack

Is it possible to alter a php file using XPath?

I"m unsure about this. Would having PHP ( or I guess any template language like Django's or Mako or whatever ) inside an html file prevent me from making changes to it with XPath?
I'm very new to XPath. I would think that you could not, but as I said, I'm unsure.
Xpath is a query language. You use it to query XML content, not change it.
You can use Xpath in conjunction with other technologies (XSLT is the first one that comes to mind) in order to query you XML and then use the results of these queries to transform your XML.
XPath doesn't change the XML document.
Use XSLT or a any other XPath-hosting language that can produce a new XML document.

XPATH remove attribute

Hi does anyone know hwo to remove an attrbute using xpath. In particular the rel attribute and its text from a link. i.e. <a href='http://google.com' rel='some text'>Link</a> and i want to remove rel='some text'.
There will be multiple links in the html i am parsing.
You can select items using xpath, but that's all it can do - it is a query language.
You need to use XSLT or an XML parser in order to remove attributes/elements.
As pointed out by Oded, Xpath merely identifies XML nodes. To remove/edit XML, you need some additional tooling.
One solution is the Ant-based plugin XMLTask (disclaimer - I wrote this). It provides a simple mechanism to read an XML file, identify parts of that using XPath, and change it (including removing nodes).
e.g.
<remove path="web/servlet/context[#id='redundant']"/>
Have you already tried using Javascript for this If that is applicable in your scenario:-
var allLinks=document.getElementsByTagName("a");
for(i=0;i<allLinks.length;i++)
{
allLinks[i].removeAttribute("rel");
}

Ruby XMLParsing Exception

I get a ParseException every time I try to parse a http get_response data in Ruby. The Exception is because of the presence of '&' in the data. How do I solve this?
Illegal character '&' in raw string (REXML::ParseException)
Is the data you're passing to the parser XML? Do other parsers complain about it?
Check to make sure that the data that you're trying to parse is well-formed XML. If you are trying to pass it HTML or RSS from the web, then it almost certainly isn't well-formed XML (HTML is not XML, though XHTML might be, and while RSS is supposed to be XML, there are lots of bad RSS generators out there that general RSS that is not well formed or invalid).
If you need to parse HTML, try Hpricot. If you need to parse RSS, use the built-in RSS parser; there are some examples here.
If you're trying to parse HTML consider using Nokogiri.
Nokogiri::HTML("<html>...</html>")
You can also try Nokogiri::XML but I believe that requires valid markup.

Resources