Nokogiri XML Parsing with Ruby Sinatra - ruby

I'm having difficulty selecting a URL within an XML document using Nokogiri, have tried using CSS selectors which works fine apart from markup within the child. I think this has to be due to the brackets < being written as &lt and &gt. Is there a way around this?
<br /><strong>URL:</strong> <a href="https://url.com/">https://url.com/</a>

require 'cgi'
CGI.unescapeHTML('<br /><strong>URL:</strong> <a href="https://url.com/">https://url.com/</a>')
gives
<br /><strong>URL:</strong> https://url.com/
Maybe you just need to unescape your html first

Related

matching <div></div> tag with regular expression in ruby

Could anyone tell me how can I match the start of <div> tag to the end of </div> tag with a regular expression in Ruby?
For example let say I have a:
<div>
<p>test content</p>
</div>
So far I have this:
< div [^>]* > [^<]*<\/div>
but it doesn't seems to work.
Nokogiri is great but, imho, there are situations when it can not be used.
For your mere case you can use this:
puts str.scan(/<div>(.*)<\/div>/im).flatten.first
<p>test content</p>
To match the <div> when it's all on one line, use:
/<div[^>]*>/
But, that will break on any markup with a new-line inside the tag. It'll also break if there is whitespace between < and div, which there could be.
Eventually, after you've added in all the extra checks for the possible ways a tag can be written you'll want to consider a better way, which would be to use a parser, like Nokogiri, which makes working with HTML and XML much easier.
For instance, since you're trying to tear apart the HTML:
<div>
<p>test content</p>
</div>
it's pretty easy to guess you really want to get to "test content". What if the HTML changed to:
<div><p>test content</p></div>
or worse:
<div
><p>
test
content
</div>
A browser won't care, nor will a good parser, but a regex will get upset and require rework.
require 'nokogiri'
require 'pp'
doc = Nokogiri.HTML(<<EOT)
<div
><p>
test
content
</div>
EOT
pp doc.at('p').text.strip.gsub(/\s+/, ' ')
# => "test content"
That's why we recommend parsers.
An HTML parser such as Nokogiri would probably be a better option than using a Regex as PinnyM pointed out.
Here is a tutorial on the Nokogiri page that describes how to search an HTML/XML document.
This stackoverflow question demonstrates something similar to what you want to accomplish using CSS selectors. Perhaps something like that would work for you.

Nokogiri/Mechanize xpath locator breaks when there is a stray start tag

I loaded a page using Mechanize:
url = 'http://www.blah.com'
agent = Mechanize.new
page = agent.get(url)
and tried to access an element using an XPath selector:
found = page.at('/html/body/table')
It returns nil because the HTML, which is out of my control, has an opening tag where it shouldn't be:
<html>
<body>
<tr>
<table>
. . .
The "stray start tag," as Firefox calls it, is ignored when the browser renders the page in real life (and Firefox gives me xpaths that ignore it), but Nokogiri can't see anything past that extra <tr>.
Is there any way to clean the HTML of hanging tags like this?
In your example it would be:
page.at '/html/body/tr/table'
But maybe it makes more sense to just do:
page.at 'table'
Use a less brittle XPath query?
found = page.at('//table')
You can clean this using Nokogiri easily:
require 'nokogiri'
html = '<html><body><tr><table><tr><td>foo</td></tr></table></tr></body></html>'
doc = Nokogiri::HTML(html)
inner_table = doc.at('//body/tr/table')
if (inner_table)
doc.at('body tr').replace(inner_table)
end
puts doc.to_html
With the result being:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><table><tr><td>foo</td></tr></table></body></html>
If your HTML is more complex, then find some sort of marker similar to the <body><tr><table> node-chain, and substitute it into the code above.
Note that I'm mixing both XPath and CSS accessors. I prefer CSS for their readability, but sometimes XPath makes it easier to get at something or is more self-documenting.
Also notice that I'm using both XPath and CSS with Nokogiri's at method. Though Nokogiri supports both at, at_css and at_xpath, I rely on at unless I need to explicitly tell Nokogiri that what I'm using as an accessor is CSS or XPath. It's a convenience thing. The same applies to Nokogiri's search method.

How to get Text value of <legend> tag using xpath in ruby watir.(using IE)

I have following code in my ie web page. I want text value of tag (means "ABCD:"). I am using ruby watir for that.
<fieldset>
<legend class="fieldset">ABCD:</legend>
<fieldset>
I have tried with below code, but I don't why its not working and giving error(undefined method `text' for nil:NilClass)
ie.element_by_xpath("//legend[contains(#class, 'fieldset')]/").text
Is there any other way or is there anything wrong in my code.
Is that the only time the class of 'fieldset' is used on the page?
The list of supported elements shows unknown for Watir and supported for Watir-Webdriver for the legend tag.
Have you tried using Watir-Webdriver and code along these lines?
puts browser.legend(:class => 'fieldset').text
That's cleaner, easier to read, and will likely be faster. Only resort to using xpath if nothing else works

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.
First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

how to get xpath of text between <br> or <br />?

</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Assuming the above, how is it possible to use Xpath to grab each fruit ? Must use xpath of some sort.
should i use substring-after(following-sibling...) ?
EDIT: I am using Nokogiri parser.
Well, you could use "//br/text()", but that will return all the text-nodes inside the <br> tags. But since the above isn't well-formed xml I'm not sure how you are going to use xpath on it. Regex is usually a poor choice for html, but there are html (not xhtml) parsers available. I hesitate to suggest one for ruby simply because that isn't "my area" and I'd just be googling...
Try the following, which gets all text siblings of <br> tags as array of strings stripped from trailing and leading whitespaces:
require 'rubygems'
reguire 'nokogiri'
doc = Nokogiri::HTML(DATA)
fruits =
doc.xpath('//br/following-sibling::text()
| //br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
puts fruits
__END__
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Is this what you want?
There are several issues here:
XPath works on XML - you have HTML which is not XML (basically, the tags don't match so an XML parser will throw an exception when you give it that text)
XPath normally also works by finding the attributes inside tags. Seeing as your <br> tags don't actually contain the text, they're just in-between it, this will also prove difficult
Because of this, what you probably want to do is use XPath (or similar) to get the contents of the div, and then split the string based on <br> occurrences.
As you've tagged this question with ruby, I'd suggest looking into hpricot, as it's a really nice and fast HTML (and XML) parsing library, which should be much more useful than mucking around with XPath

Resources