Ruby: Convert HTML/Redcloth to plain text - ruby

Does anybody know how I can convert html to plain text with Ruby. Well really I need to convert RedCloth to plain text, either way would be fine.
I'm not talking about just striping out the tags (that is all I've done so far). For example I would like an ordered list to retain the numbers, unordered lists to use an asterisk for bullets etc.
def red_cloth_to_plain_text(s)
s = RedCloth.new(s).to_html
s = strip_tags(s)
s = html_unescape(s) # reverse of html_escape
s = undo_red_cloths_html_codes(s)
return s
end
Maybe I have to attempt a RedCloth to plain text formatter

You need to make a new formatter class.
module RedCloth::Formatters
module PlainText
include RedCloth::Formatters::Base
# ...
end
end
I won't write your code for you today but this is very easy to do. Read the RedCloth source if you doubt me: it's only 346 lines for the HTML formatter.
So, once you have your PlainText formatter you patch the class and use it:
module RedCloth
class TextileDoc
def to_txt( *rules )
apply_rules(rules)
to(RedCloth::Formatters::PlainText)
end
end
end
print RedCloth.new(str).to_txt

Joseph Halter wrote a RedCloth plain formatter:
http://github.com/JosephHalter/redcloth-formatters-plain
Example usage:
RedCloth.new("p. this is *simple* _test_").to_plain
will return:
"this is simple test"

That may be what you have to do. You're not the first to want this, but I'm guessing it's not part of the library yet because everyone wants their plaintext a little different.

Related

writing a short script to process markdown links and handling multiple scans

I'd like to process just links written in markdown. I've looked at redcarpet which I'd be ok with using but I really want to support just links and it doesn't look like you can use it that way. So I think I'm going to write a little method using regex but....
assuming I have something like this:
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
tmp=str.scan(/\[.*\]\(.*\)/)
or if there is some way I could just gsub in place [hope](http://www.github.com) -> <a href='http://www.github.com'>hope</a>
How would I get an array of the matched phrases? I was thinking once I get an array, I could just do a replace on the original string. Are there better / easier ways of achieving the same result?
I would actually stick with redcarpet. It includes a StripDown render class that will eliminate any markdown markup (essentially, rendering markdown as plain text). You can subclass it to reactivate the link method:
require 'redcarpet'
require 'redcarpet/render_strip'
module Redcarpet
module Render
class LinksOnly < StripDown
def link(link, title, content)
%{#{content}}
end
end
end
end
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
md = Redcarpet::Markdown.new(Redcarpet::Render::LinksOnly)
puts md.render(str)
# => here is my thing hope and ...
This has the added benefits of being able to easily implement a few additional tags (say, if you decide you want paragraph tags to be inserted for line breaks).
You could just do a replace.
Match this:
\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)
Replace with:
\1
In Ruby it should be something like:
str.gsub(/\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)/, '\1')

How to build an Object from XML, modify, then write to File in Ruby

I'm having an awful time trying to use a library to parse an XML File into a hash like object, modify it, then print it back out to another XML file in Ruby. For a class I'm taking, we're supposed to use a Java JAXB like library where we convert XML into an object. We've already done SAX and DOM methods so we can't use those methods of XML de-serialization. Nokogiri helped me with both of these in Ruby.
The only problem is that besides the SIMPLE modifications I'm making to the objects, when I write to file it has drastic differences. Is there a Ruby library meant for doing just this? I've tried: ROXML, XML::Mapping, and ActiveSupport::CoreExt. The only one I can get to even run is ActiveSupport, and even then it starts putting element attributes as child elements in the output XML.
I'm willing to try out XmlSimple, but I'm curious has anyone actually had to do this before/run into the same problems? Again, I can't read in lines one at a time like SAX or build a Tree like structure like DOM, it needs to be a hash like object.
Any help is much appreciated!
You should have a look into nokogiri: http://nokogiri.org/
Then you can parse the XML like this :
xml_file = "some_path"
#xml = Nokogiri::XML(File.open xml_file)
#xml.xpath('//listing').each do |node|
style = node.search("style").text
end
With Xpath, you can perform queries in the XML :
#xml.xpath("//listing[name='John']").first(10)
OK, I got it working. After looking at ActiveSupport::CoreExt 's source code I found it just uses a gem called xml-simple. What's obnoxious is the gem, library name in the require statement, and class name are a mixture of hyphenated and non hyphenated spellings. For future reference here's what I did:
# gem install xml-simple
# ^ all lowercase, hyphenated
require 'xmlsimple'
# ^ all lowercase, not hyphenated
doc = XmlSimple.xml_in 'hw3.xml', 'KeepRoot' => true
# ^ Camel cased (it's a class), not hyphenated
# doc.class => Hash
# manipulate doc as a hash
file = File.new('HW3a.xml', 'w')
file.write("<?xml version='1.0' encoding='utf-8'?>\n")
file.write(XmlSimple.xml_out doc, 'KeepRoot' => true)
I hope this helps someone. Also make sure you pay attention to case and hyphens with this gem!!!

What would be the best way to take a string of html, chop it up, and put each piece into an array?

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}

How to tidy up malformed xml in ruby

I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database.
For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained.
I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag and then it doesn't fire off on any more of them. But it is spiting out the right characters.
class Filing < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
puts "starting: #{name}"
end
def characters str
puts "chars: #{str}"
end
def end_element name
puts "ending: #{name}"
end
end
It seems like this would be the best option because I can simply have it ignore the other xml or html doc. Also it would make the most sense because some of these documents can get quite large so storing the whole dom in memory would probably not work.
Here are some example files: 1 2 3
I'm starting to think I'll just have to write my own custom parser
Nokogiri's normal DOM mode is able to automatically fix-up the XML so it is syntactically correct, or a reasonable facsimile of that. It sometimes gets confused and will shift closing tags around, but you can preprocess the file to give it a nudge in the right direction if need be.
I saved the XML #1 out to a document and loaded it:
require 'nokogiri'
doc = ''
File.open('./test.xml') do |fi|
doc = Nokogiri::XML(fi)
end
puts doc.to_xml
After parsing, you can check the Nokogiri::XML::Document instance's errors method to see what errors were generated, for perverse pleasure.
doc.errors
If using Nokogiri's DOM model isn't good enough, have you considered using XMLLint to preprocess and clean the data, emitting clean XML so the SAX will work? Its --recover option might be of use.
xmllint --recover test.xml
It will output errors on stderr, and the code on stdout, so you can pipe it easily to another file.
As for writing your own parser... why? You have other options available to you, and reinventing a nicely implemented wheel is not a good use of time.

REXML is wrapping long lines. How do I switch that off?

I am creating an XML document using REXML
File.open(xmlFilename,'w') do |xmlFile|
xmlDoc = Document.new
# add stuff to the document...
xmlDoc.write(xmlFile,4)
end
Some of the elements contain quite a few arguments and hence, the according lines can get quite long. If they get longer than 166 chars, REXML inserts a line break. This is of course still perfectly valid XML, but my workflow includes some diffing and merging, which works best if each element is contained in one line.
So, is there a way to make REXML not insert these line-wrapping line breaks?
Edit: I ended up pushing the finished XML file through tidy as the last step of my script. If someone knew a nicer way to do this, I would still be grateful.
As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to fix it by overwriting the Formatters::Pretty class's write_text method so that it uses the configurable #width attribute instead of the hard-coded 80.
require "rubygems"
require "rexml/document"
include REXML
long_xml = "<root><tag>As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to *fix* it by overwriting the Formatters::Pretty class's write_text method.</tag></root>"
xml = Document.new(long_xml)
#fix bug in REXML::Formatters::Pretty
class MyPrecious < REXML::Formatters::Pretty
def write_text( node, output )
s = node.to_s()
s.gsub!(/\s/,' ')
s.squeeze!(" ")
#The Pretty formatter code mistakenly used 80 instead of the #width variable
#s = wrap(s, 80-#level)
s = wrap(s, #width-#level)
s = indent_text(s, #level, " ", true)
output << (' '*#level + s)
end
end
printer = MyPrecious.new(5)
printer.width = 1000
printer.compact = true
printer.write(xml, STDOUT)
Short answer: yes and no.
REXML uses different formatters based on the value you specify for indent. If you leave the default -1, it uses REXML::Formatters::Default. If you give it a value like 4, it uses REXML::Formatters::Pretty. The pretty formatter does have logic in it to wrap lines (though it looks like it wraps at 80, not 166), when dealing with text (not tags or attributes). For example, the contents of
<p> a paragraph tag </p>
would be wrapped at 80 characters, but
<a-tag with='a' long='list' of='attributes'/>
would not be wrapped.
Anyway the 80 is hard-coded in rexml/formatters/pretty.rb and not configurable. And if you use the default formatter with no indent, then it's mostly just a raw dump without added line breaks. You could try the transitive formatter (see docs for Document.write), but it's broken in some version of ruby and might require a code hack. It probably isn't what you want anyway.
You might try taking a look at Builder::XmlMarkup from the builder gem.

Resources