HTML to Plain Text with Ruby? - ruby

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.
If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:
\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!
So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.

Actually, this is much simpler:
require 'rubygems'
require 'nokogiri'
puts Nokogiri::HTML(my_html).text
You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.

You could start with something like this:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")

Is simply stripping tags and excess line breaks acceptable?
html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')
First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.

require 'open-uri'
require 'nokogiri'
url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))
text = ''
doc.css('p,h1').each do |e|
text << e.content
end
puts text
This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.

I'm using the sanitize gem.
(" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")
It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.

if you are using rails you can:
html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>'
puts ActionView::Base.full_sanitizer.sanitize(html)

You want hpricot_scrub:
http://github.com/UnderpantsGnome/hpricot_scrub
You can specify which tags to strip / keep in a config hash.

if its in rails, you may use this:
html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe

Building slightly on Matchu's answer, this worked for my (very similar) requirements:
html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish
Hope it makes someone's life a bit easier :-)

Related

writing a short script to process markdown links and handling multiple scans

I'd like to process just links written in markdown. I've looked at redcarpet which I'd be ok with using but I really want to support just links and it doesn't look like you can use it that way. So I think I'm going to write a little method using regex but....
assuming I have something like this:
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
tmp=str.scan(/\[.*\]\(.*\)/)
or if there is some way I could just gsub in place [hope](http://www.github.com) -> <a href='http://www.github.com'>hope</a>
How would I get an array of the matched phrases? I was thinking once I get an array, I could just do a replace on the original string. Are there better / easier ways of achieving the same result?
I would actually stick with redcarpet. It includes a StripDown render class that will eliminate any markdown markup (essentially, rendering markdown as plain text). You can subclass it to reactivate the link method:
require 'redcarpet'
require 'redcarpet/render_strip'
module Redcarpet
module Render
class LinksOnly < StripDown
def link(link, title, content)
%{#{content}}
end
end
end
end
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
md = Redcarpet::Markdown.new(Redcarpet::Render::LinksOnly)
puts md.render(str)
# => here is my thing hope and ...
This has the added benefits of being able to easily implement a few additional tags (say, if you decide you want paragraph tags to be inserted for line breaks).
You could just do a replace.
Match this:
\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)
Replace with:
\1
In Ruby it should be something like:
str.gsub(/\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)/, '\1')

nokogiri doc.xpath('head') returns nil

I am trying to get all the scripts declared in the head section of a given html, but no matter how I try, it always returns nil.
doc = Nokogiri::HTML(open('http://www.walmart.com.br/'))
puts doc.at('body') # returns nill
doc.xpath('//html/head').each # this also will never iterate
Any suggestions?
The page's DOCTYPE isn't valid, so Nokogiri parses the page improperly. A quick, inefficient fix to the problem:
require 'nokogiri'
require 'open-uri'
require 'pp'
# Request the HTML before parsing
html = open("http://www.walmart.com.br/").read
# Replace original DOCTYPE with a valid DOCTYPE
html = html.sub(/^<!DOCTYPE html(.*)$/, '<!DOCTYPE html>')
# Parse
doc = Nokogiri::HTML(html)
# Party.
pp doc.xpath("/html/head")
Ok, when I tried it in script/console, I could indeed get something useful for:
doc.at('body')
so I'm not sure what's going wrong there for you.
For the html head, I can't get the head element either. html works fine, but head either way doesn't.
I think there's something screwy with that walmart page. I tried doing the same thing for
Nokogiri::HTML(open('http://google.com/'))
and it worked just fine.
So unless you can figure out what they're doing to stop you from accessing parts of the page... then I don't know.
If you can deal with all scripts from the doc, I found that this one works just fine:
doc.xpath('//script')

Parsing SEC Edgar XML file using Ruby into Nokogiri

I'm having problems parsing the SEC Edgar files
Here is an example of this file.
The end result is I want the stuff between <XML> and </XML> into a format I can access.
Here is my code so far that doesn't work:
scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)
Ok, there are a couple of things wrong:
sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.
Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(
open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603
I recommend practicing in IRB and reading the docs for Nokogiri
> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]
that should get you going
Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm
Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:
http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml
Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

REXML is wrapping long lines. How do I switch that off?

I am creating an XML document using REXML
File.open(xmlFilename,'w') do |xmlFile|
xmlDoc = Document.new
# add stuff to the document...
xmlDoc.write(xmlFile,4)
end
Some of the elements contain quite a few arguments and hence, the according lines can get quite long. If they get longer than 166 chars, REXML inserts a line break. This is of course still perfectly valid XML, but my workflow includes some diffing and merging, which works best if each element is contained in one line.
So, is there a way to make REXML not insert these line-wrapping line breaks?
Edit: I ended up pushing the finished XML file through tidy as the last step of my script. If someone knew a nicer way to do this, I would still be grateful.
As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to fix it by overwriting the Formatters::Pretty class's write_text method so that it uses the configurable #width attribute instead of the hard-coded 80.
require "rubygems"
require "rexml/document"
include REXML
long_xml = "<root><tag>As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to *fix* it by overwriting the Formatters::Pretty class's write_text method.</tag></root>"
xml = Document.new(long_xml)
#fix bug in REXML::Formatters::Pretty
class MyPrecious < REXML::Formatters::Pretty
def write_text( node, output )
s = node.to_s()
s.gsub!(/\s/,' ')
s.squeeze!(" ")
#The Pretty formatter code mistakenly used 80 instead of the #width variable
#s = wrap(s, 80-#level)
s = wrap(s, #width-#level)
s = indent_text(s, #level, " ", true)
output << (' '*#level + s)
end
end
printer = MyPrecious.new(5)
printer.width = 1000
printer.compact = true
printer.write(xml, STDOUT)
Short answer: yes and no.
REXML uses different formatters based on the value you specify for indent. If you leave the default -1, it uses REXML::Formatters::Default. If you give it a value like 4, it uses REXML::Formatters::Pretty. The pretty formatter does have logic in it to wrap lines (though it looks like it wraps at 80, not 166), when dealing with text (not tags or attributes). For example, the contents of
<p> a paragraph tag </p>
would be wrapped at 80 characters, but
<a-tag with='a' long='list' of='attributes'/>
would not be wrapped.
Anyway the 80 is hard-coded in rexml/formatters/pretty.rb and not configurable. And if you use the default formatter with no indent, then it's mostly just a raw dump without added line breaks. You could try the transitive formatter (see docs for Document.write), but it's broken in some version of ruby and might require a code hack. It probably isn't what you want anyway.
You might try taking a look at Builder::XmlMarkup from the builder gem.

How to load a Web page and search for a word in Ruby

How to load a Web page and search for a word in Ruby??
Here's a complete solution:
require 'open-uri'
if open('http://example.com/').read =~ /searchword/
# do something
end
For something simple like this I would prefer to write a couple of lines of code instead of using a full blown gem. Here is what I will do:
require 'net/http'
# let's take the url of this page
uri = 'http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby'
response = Net::HTTP.get_response(URI.parse(uri)) # => #<Net::HTTPOK 200 OK readbody=true>
# match the word Ruby
/Ruby/.match(response.body) # => #<MatchData "Ruby">
I can go to the path of using a gem if I need to do more than this and I need to implement some algorithm for that which is already being done in one of the gems
I suggest using Nokogiri or hpricot to open and parse HTML documents. If you need something simple that doesn't require parsing the HTML, you can just use the open-uri library built in to most ruby distributions. If need something more complex for posting forms (or logging in), you can elect to use Mechanize.
Nokogiri is probably the preferred solution post _why, but both are about as simple as this:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri(open("http://www.example.com"))
if doc.inner_text.match(/someword/)
puts "got it"
end
Both also allow you to search using xpath-like queries or CSS selectors, which allows you to grab items out of all divs with class=foo, for example.
Fortunately, it's not that big of a leap to move between open-uri, nokogiri and mechanize, so use the first one that meets your needs, and revise your code once you realize you need the capabilities of one of the other libraries.
You can also use mechanize gem, something similar to this.
require 'rubygems'
require 'mechanize'
mech = WWW::Mechanize.new.get('http://example.com') do |page|
if page.body =~ /mysearchregex/
puts "found it"
end
end

Resources