I'm experimenting with Ruby and Nokogiri.
I've figured out how to open a local html document and select nodes by classname:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(open("file"))
puts doc.css('a.target')
How then do I dump the document without the nodes I've selected for?
Should be:
doc.css('a.target').remove
puts doc.at('html').to_s
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
simple parsing in ruby
I am trying to verify a title in a website and after some trial and error I have found that this can be done in ruby by using nokogiri and rest-client
require 'nokogiri'
require 'rest-client'
page = Nokogiri::HTML(RestClient.get("http:/#{user.username}.domain.com/"))
simian = page.at_css("title").text
if simian == "Welcome to"
puts "default monkey"
else
puts "website updated"
end
unfortunately for a large number of websites this doesn't always seems to work as it returns
RestClient::InternalServerError at /admin/users/list
500 Internal Server Error
I was wondering if there is any option to achieve the same by simply using
mycurl = %x(curl http://........) what would be an efficient way to use that by parsing the title and without using any gem or can the curl option be used directly with nokogiri ?
Thanks
After reading your question wasn't really sure if you are set with those 2 gems or not, here is another way that may prove simpler.
require 'open-uri'
url="http://google.com"
source = open(url).read
source[/<title>(.*)<\/title>, 1]
There's two parts to this. One is fetching the page and the other is parsing. For fetching, you don't really need the rest-client gem, when open-uri from the standard library will do. Nokogiri does the parsing, and it is not likely your problem. Try this:
require 'open-uri'
require 'nokogiri'
page = Nokogiri::HTML(open('http://example.com/'))
puts page.at('title').text
Looking at this site here: http://www.grammy.com/nominees/search?artist=&title=&year=1958&genre=All
I can view all winners by year -- how can I scrape just the names of the winners on each page (for each year) and get them in a simple database?
thanks!
This will get you the actual names, cleaning them up a little bit and inserting them into a DB is an exercise left to you:
require 'rubygems'
require 'hpricot'
require 'open-uri'
html = open("http://www.grammy.com/nominees/search?artist=&title=&year=1958&genre=All")
doc = Hpricot(html)
doc.search("td.views-field-field-nominee-extended-value").each do |winner|
puts winner.inner_html.strip
end
Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.
If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:
\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!
So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.
Actually, this is much simpler:
require 'rubygems'
require 'nokogiri'
puts Nokogiri::HTML(my_html).text
You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.
You could start with something like this:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")
Is simply stripping tags and excess line breaks acceptable?
html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')
First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.
require 'open-uri'
require 'nokogiri'
url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))
text = ''
doc.css('p,h1').each do |e|
text << e.content
end
puts text
This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.
I'm using the sanitize gem.
(" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")
It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.
if you are using rails you can:
html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>'
puts ActionView::Base.full_sanitizer.sanitize(html)
You want hpricot_scrub:
http://github.com/UnderpantsGnome/hpricot_scrub
You can specify which tags to strip / keep in a config hash.
if its in rails, you may use this:
html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe
Building slightly on Matchu's answer, this worked for my (very similar) requirements:
html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish
Hope it makes someone's life a bit easier :-)
How to load a Web page and search for a word in Ruby??
Here's a complete solution:
require 'open-uri'
if open('http://example.com/').read =~ /searchword/
# do something
end
For something simple like this I would prefer to write a couple of lines of code instead of using a full blown gem. Here is what I will do:
require 'net/http'
# let's take the url of this page
uri = 'http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby'
response = Net::HTTP.get_response(URI.parse(uri)) # => #<Net::HTTPOK 200 OK readbody=true>
# match the word Ruby
/Ruby/.match(response.body) # => #<MatchData "Ruby">
I can go to the path of using a gem if I need to do more than this and I need to implement some algorithm for that which is already being done in one of the gems
I suggest using Nokogiri or hpricot to open and parse HTML documents. If you need something simple that doesn't require parsing the HTML, you can just use the open-uri library built in to most ruby distributions. If need something more complex for posting forms (or logging in), you can elect to use Mechanize.
Nokogiri is probably the preferred solution post _why, but both are about as simple as this:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri(open("http://www.example.com"))
if doc.inner_text.match(/someword/)
puts "got it"
end
Both also allow you to search using xpath-like queries or CSS selectors, which allows you to grab items out of all divs with class=foo, for example.
Fortunately, it's not that big of a leap to move between open-uri, nokogiri and mechanize, so use the first one that meets your needs, and revise your code once you realize you need the capabilities of one of the other libraries.
You can also use mechanize gem, something similar to this.
require 'rubygems'
require 'mechanize'
mech = WWW::Mechanize.new.get('http://example.com') do |page|
if page.body =~ /mysearchregex/
puts "found it"
end
end