Convert HTML to plain text and maintain structure/formatting, with ruby - ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.

First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

Related

Closing anchor tags with HtmlAgilityPack

I am using the HtmlAgilityPack to scrape crummy html and get links, raw text, etc. I'm running into a few pages that have inconsistently closed <a> tags, like this:
<html>
<head></head>
<body>
<a href=...>Here's a great link! <a href=...>Here's another one!</a>
Here's some unrelated text.
</body></html>
HAP parses this, and helpfully closes the open <a> tag, but only at the very end of the document:
<html>
<head></head>
<body>
Here's a great link! <a href="...">Here's another one!
Here's some unrelated text.
</a></body></html>
In practice this means that the InnerText of any unclosed link contains all text from the rest of the page, which gets exciting when parsing a page that may contain thousands of unclosed tags and megabytes of text.
So, how can I make HAP close those tags immediately, ideally putting the close just before the next open so that there is never any overlap for an <a>? I've played around with OptionFixNestedTags and OptionAutoCloseOnEnd with no luck, and I've found advice on how to allow overlap, but I'm drawing a blank on actually fixing it.

Finding partial string within horrible HTML using Nokogiri

Using Nokogiri, I want to fetch the part of the paragraph that comes after the <span> tags.
I am no regex hero, and it is the only thing that I need to discover before I can move forward. The only constant in the list is the | symbol, and the ugly way is to get the whole thing and split and join it I guess. Hopefully, there is a smarter, more elegant way!
<ul>
<li>
<p>
<strong>I don't care about </strong>
<span>|</span>
this I do care about
</p></li> ...
</ul>
If your HTML is that simple, then this will work:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ul>
<li>
<p>
<strong>I don't care about </strong>
<span>|</span>
this I do care about
</p></li> ...
</ul>
EOT
doc.at('p').children.last # => #<Nokogiri::XML::Text:0x3ff1995c5b00 "\nthis I do care about\n">
doc.at('p').children.last.text # => "\nthis I do care about\n"
Parsing HTML and XML is really a matter of looking for landmarks that can be used to find what you want. In this case, <span> is OK, but getting the content you want based on that isn't quite as easy as looking up one level, to the <p> tag, grabbing its content, the children, selecting the last node in that list, which is text node containing the text you want.
The reason using the <span> tag is not the way I'd go is, if the HTML formatting changes, the number of nodes between <span> and your desired text could change. Intervening text nodes containing "\n" could be introduced for the formatting of the source, which would mess up a simple indexed lookup. To work around that, the code would have to ignore blank nodes and find the one that wasn't blank.
I am no regex hero...
And you shouldn't try to be with HTML or XML. They're too flexible and can confound regular expressions unless you're dealing with extremely trivial searches on very static HTML, which isn't very likely in the real internet unless you're scanning abandoned pages. Instead, learn and rely on decent HTML/XML parsers, that can reduce a page into a DOM, making it easy to search and traverse the markup.

How do I find matching <pre> tags using a reqular expression?

I am trying to create a simple blog that has code inclosed in <pre> tags.
I want to display "read more" after the first closing </pre> tag is encountered, thus showing only the first code segment.
I need to display all text, HTML, code up to the first closing </pre> tag.
What I've come up with so far is the follow:
/^(.*<\/pre>).*$/m
However, this matches every closing </pre> tag up to the last one encountered.
I thought something like the following would work:
/^(.*<\/pre>{1}).*$/m
It of course does not.
I've been using Rubular.
My solution thanks to your guys help:
require 'nokogiri'
module PostsHelper
def readMore(post)
doc = Nokogiri::HTML(post.message)
intro = doc.search("div[class='intro']")
result = Nokogiri::XML::DocumentFragment.parse(intro)
result << link_to("Read More", post_path(post))
result.to_html
end
end
Basically in my editor for the blog I wrap the blog preview in div class=intro
Thus, only the intro is displayed with read more added on to it.
This is not a job for regular expressions, but for a HTML/XML parser.
Using Nokogiri, this will return all <pre> blocks as HTML, making it easy for you to grab the one you want:
require 'nokogiri'
html = <<EOT
<html>
<head></head>
<body>
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
pre_blocks = doc.search('pre')
puts pre_blocks.map(&:to_html)
Which will output:
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
You can capture all text upto the first closing pre tag by modifying your regular expression to,
/^(.*?<\/pre>{1}).*$/m
This way you can get the matched text by,
text.match(regex)[1]
which will return only the text upto the first closing pre tag.
Reluctant matching might help in your case:
/^(.*?<\/pre>).*$/m
But it's probably not the best way to do the thing, consider using some html parser, like Nokogiri.

how to get xpath of text between <br> or <br />?

</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Assuming the above, how is it possible to use Xpath to grab each fruit ? Must use xpath of some sort.
should i use substring-after(following-sibling...) ?
EDIT: I am using Nokogiri parser.
Well, you could use "//br/text()", but that will return all the text-nodes inside the <br> tags. But since the above isn't well-formed xml I'm not sure how you are going to use xpath on it. Regex is usually a poor choice for html, but there are html (not xhtml) parsers available. I hesitate to suggest one for ruby simply because that isn't "my area" and I'd just be googling...
Try the following, which gets all text siblings of <br> tags as array of strings stripped from trailing and leading whitespaces:
require 'rubygems'
reguire 'nokogiri'
doc = Nokogiri::HTML(DATA)
fruits =
doc.xpath('//br/following-sibling::text()
| //br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
puts fruits
__END__
</div>
apple
<br>
banana
<br/>
watermelon
<br>
orange
Is this what you want?
There are several issues here:
XPath works on XML - you have HTML which is not XML (basically, the tags don't match so an XML parser will throw an exception when you give it that text)
XPath normally also works by finding the attributes inside tags. Seeing as your <br> tags don't actually contain the text, they're just in-between it, this will also prove difficult
Because of this, what you probably want to do is use XPath (or similar) to get the contents of the div, and then split the string based on <br> occurrences.
As you've tagged this question with ruby, I'd suggest looking into hpricot, as it's a really nice and fast HTML (and XML) parsing library, which should be much more useful than mucking around with XPath

IE8 & FF XHTML error or badly formed span?

I recently have found a strange occurrence in IE8 & FF.
The designers where using js to dynamically create some span tags for layout (they were placing rounded corner graphics on some tabs). Now the xhtml, in js, looked like this: <span class=”leftcorner” /><span class=”rightcorner” /> and worked perfectly!
As we all know dynamically rendering elements in js can be quite processor intensive so I moved the elements from js into the page source, exactly as above.
... and it didn’t work... not only didn’t it work, it crashes IE8.The fix was simple, put the close span in ie: <span class=”leftcorner”></span>
I am a bit confused by this.
Firstly as far as I am aware <span class=”leftcorner” /> is perfectly valid XHTML!
Secondly it works dynamically, but not in XHTML?!?!?
Can anyone shed any light on this or is it simply another odd occurrence of browsers?
The major browsers only support a small subset of self-closing tags. (See this answer for a complete list.)
Depending on how you were creating the elements in JS, the JavaScript engine probably created a valid element to place in the DOM.
I had similar problem with a tags in IE.
The problem was my links looked like that (it was an icon set with the css, so I didn't need the text in it:
<a href="link" class="icon edit" />
Unfortunately in IE these links were not displayed at all. They have to be in
format (leaving empty text didn't work as well so I put there). So what I did is I add an few extra JS lines to fix it as I didn't want to change all my HTML just for this browser (ps. I'm using jQuery for my JS).
if ($.browser.msie) {
$('a.icon').html('&nbsp');
}
IE in particular does not support XHTML. That is, it will never apply proper XML parsing rules to a document - it will treat it as HTML even with proper DOCTYPE and all. XHTML is not always valid SGML, however. In some cases (such as <br/>) IE can figure it out because it's prepared to parse tagsoup, and not just valid SGML. However, in other cases, the same "tagsoup" behavior means that it won't treat /> as self-closing tag terminator.
In general, my advice is to just use HTML 4.01 Strict. That way you know exactly what to expect. And there's little point in feeding XHTML to browsers when they're treating it as HTML anyway...
See I think that one of the answers to Is writing self closing tags for elements not traditionally empty bad practice? will answer your question.
XHTML is only XHTML if it is served as application/xhtml+xml — otherwise, at least as far as browsers are concerned, it is HTML and treated as tag soup.
As a result, <span /> means "A span start tag" and not "A complete span element". (Technically it should mean "A span start tag and a greater than sign", but that is another story).
The XHTML spec tells you what you need to do to get your XHTML to parse as HTML.
One of the rules is "For non-empty elements, end tags are required". The list of elements includes a quick reference to which are empty and which are not.

Resources