Letting Nokogiri decide whether to use #fragment or #parse - ruby

I have a piece of HTML that I would like to parse with Nokogiri, but I do not know whether it is a full HTML document (with DOCTYPE, etc) or a fragment (e.g. just a div with some elements in it).
This makes a difference for Nokogiri, because it should use #fragment for parsing fragments but #parse for parsing full documents.
Is there a way to determine whether a given piece of text is a fragment or a full HTML document?
Denis

Depends on how trashed your page is, but
/^(?:\s*<!DOCTYPE)|(?:\s*<html)/
should work in most cases.

The simplest way would be to look for the mandatory <html> tag, using for instance a regular expression /<html[\s>])/ (allowing attributes).
Is this sufficient to solve your problem?

Related

Get the inner XML using XPath?

This is my XML
<my_xml>
<record>
<p>hello <b>world</b> this is some html</p>
</record>
</my_xml>
Can I use XPath to return the following?
<p>hello <b>world</b> this is some html</p>
my_xml/record/child::*
child::* selects all element children of the context node
see details
The quick answer is, no. You can't accomplish this with XPath, but, once you select the parent node (i.e. "record" in your example), you should be able to manipulate it in whichever language you are using to parse the XML. Unfortunately, it may not be "easy".
It sounds like you would want something like the innerHTML property, but for XML DOM instead of the HTML DOM. Unfortunately, nothing like this exists for the XML DOM. If you don't care about the nodes themselves, you could use the textContent property; in the case of your example, you would get "hello world this is some html", which doesn't seem to be what you want.
Check out this similar question, which includes a parsing algorithm in Java. It seems that you will need to write a similar algorithm in whichever language you're using to parse the XML.
For anyone looking for this in the future, this IS very much possible to do using a DOT, that will return the entire node content as text (at least in MSSQL xpath it does).
'(/my_xml/record/.)[1]'

Getting HTML table values using XPath in Nokogiri?

I'm trying to get some values from a table using the XPath of this table but it only returns [] (empty):
require 'nokogiri'
require 'open-uri'
url = "http://riopretrans.com.br/linhas.php?ln=106"
doc = Nokogiri::HTML(open(url))
doc.xpath("html/body/table[1]/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/div/table[1]/tbody/tr[3]/td/div/div/center/font/table").each do |lines|
puts lines.content
end
I found the table's XPath using Firebug so I think it's correct.
Can anyone help me?
Remove tbody/ from your XPath.
The tbody tag is part of the HTML spec for table tags, but it's rarely actually implemented in the HTML. Some browsers insert it, though it's not in the HTML for the page. Firebug then sees it, which you see, and think it must be so.
Even using "view source" can confuse you, because you expect that to be accurate, but the browser has already munged the content to include "tbody", so, well, basically they're lying to you.
You can confirm this by looking at the HTML that Nokogiri is getting. Use puts doc.to_html['tbody'] and see if you get "tbody" or nil.
...Because in html file all of them were specified(written by programmer)
If you are positive they actually belong there, because they exist in the HTML source, then you'll need to take apart your XPath. Start with a broad path, and slowly add to it to narrow down your search.
The server is unreachable for me right now, so I can't confirm that, or dig into what the hierarchy should be, and show an example. (That's why actually giving us REAL HTML in your question is SO much better than a link which might not work.)
An alternate is to use XPath's // (search anywhere) with a less restrictive path, or CSS selectors. Either way, actually examine the HTML, instead of relying on Firebug's XPath, and determine what "landmarks" you can use in the source to navigate to your desired table. Today's HTML is chock-full of id and class parameters, or a particular series of tags that act as a finger-print for the table you want. Search for the minimum needed to pin-point that table.
If the table is something like <table id="foo">, then use doc.at('table#foo'). If it's in a <div class="bar"><table> use doc.at('div.bar table'). In any case, use the smallest sized accessor necessary to get the job done. That will increase your chances of success if anything in the HTML changes in the future.

facing issue to find xpath expression

My XPath '//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']' is working fine, but after this tag there is '#document' text present and after this '#document' there is html tag, so when I extend the XPath expression as '//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/#document/html', it is throwing exception as follows:
Caused by: class org.jaxen.saxpath.XPathSyntaxException:
//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/#document:
70: Expected one of '.', '..', '#', '*', QName.
So please guide me how to write XPath for this.
Thanks,
Dhananjay
From what I can gather, XPath does not descend into iframes.You see, XPath expressions are tied to a particular XML document, such as an HTML document,1 that they can be evaluated against. In the browser, an iframe counts as a separate document. The <iframe> node itself is a part of the parent document; but it is merely a pointer to another document (the iframe's contents) which is completely separate.
That seems to be the gist of this email chain, and seems to fall naturally out of the fact that XPath expressions are evaluated by calling document.evaluate (that is, a member of a particular document object), as implemented in Firefox. This suggests that the overlap between the various specs defining iframes and XPath excludes traversing that document boundary in a single XPath expression — or at least that seems to be Mozilla's interpretation.
But take note that all of this is an guesswork based on Firefox's particular implementation of the XPath specification. This limitation may or may not apply to other browsers, but I would suspect that it does.
It also seems to explain why Selenium requires you to switch context from one document (the parent HTML page) to another (the iframe itself) in order to execute XPath expressions against it, as hinted at by the solution posted by #singaravelan, and others.
1But only if the HTML document is magical enough! (Not all HTML documents are well-formed XML: browsers are much more lenient than XML parsers can be; Cf. #MathiasMüller's comment.)
You haven't shown your source XML, but one thing we know for sure is that it doesn't contain an element called "#document", because that isn't a legal element name. For the same reason, you can't request an element called "#document" in your XPath expression.
You can use with different XPath to bypass the word: #document with the word: descendant
For example:
//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/descendant::*[1]
or something like that. It is depend on what do you want in the inner html.
First thanks to raise this question. I am also face the same problem.
with help of following line I got solved for my case.
driver.SwitchTo().Frame(driver.FindElement(By.Name("fraToc")));
Thanks.

Using a modified Nokogiri to parse Wikitext?

Apologies for the length of this question, it's more of a "is this possible" than "how do I do this".
My objective is to remove everything but plain text from Wikipedia markup -- tables, templates, formatting. Whether these are in wikitext markup (e.g. ''bold text'') or HTML (<b>bold text</b>).
Wikipedia text is a mix of custom tags: templates {{ ... }}, tables {| ... |}, links [[ ... ]] and HTML elements. Parsing it is kind of a nightmare. You can't use regular expressions because the tags can be nested, and it can contain HTML so almost anything is possible. Some of the text within the HTML I'd want to keep (stuff within bold text), but other things like tables would need to be stripped entirely.
I thought about re-purposing an XML parser like Nokogiri, adding {{/}} as alternatives to <x>/</x>.
Does anyone who knows Nokogiri (or another Ruby XML parser) know if this is possible or even a good idea?
My alternative is to repurpose an existing parser like WikiCloth for the wiki markup, and then try to remove any leftover HTML via another method.
This sounds like a good idea. However, it would not be possible for you to 'patch' Nokogiri, "adding {{/}} as alternatives to <x>/</x>". This is because the bulk of the work done by Nokogiri—parsing and XPath and generating the string representation of a DOM—is actually done by libxml2 in the back end. You'd have to patch and recompile libxml2 (and then rebuild Nokogiri against your new version)…but at that point I have no idea how Nokogiri would behave.
You might have better luck trying to patch REXML, since that is written in pure Ruby.

Some websites not allowed to be parsed by xpath?

I am trying to parse one element from a website that is inside of a table. This is the exact xpath expression that I use:
[xpathParser search:#"/table[1]/tr[2]/td[1]"];
However, when I run the program, my string comes up empty. I'm wondering if the site is blocking me from parsing, or whether my expression is correct. If it helps, this is the site, and the piece I am trying to parse is the element Atlantic.
http://cluster.leaguestat.com/download.php?client_code=ahl&file_path=daily-report/daily-report.html
There are several 'atlantic' sections on the page, not sure what you mean by the element Atlantic. Your xpath expression might not be correct, as the 'tr' is not a direct descendant of table (there is a tbody in between). You might want to try //table/tbody/tr[2]/td[1], as well as the xpath checker firefox plugin to test expressions.

Resources