Some spaces are removed when scraping a string using Nokogiri - ruby

I am very new to Ruby and I am currently working on site-scraping using Nokogiri to practice. I would like to scrape the details from 'deals' from a random group-buying site. I have been able to successfully scrape a site but I am having problems in parsing the output. I tried the solutions suggested in here and also using regex. So far, I have failed.
I am trying to parse the following title/description from this page:
Frosty Frappes starting at P100 for P200 worth at Café Tavolo – up to 55% off
This is what I got:
FrostyFrappes starting at P100 for P200 worth at Caf Tavolo up to 55% off
Here are the snippets in my code:
require 'uri'
require 'nokogiri'
html = open(url)
doc = Nokogiri::HTML(html.read)
doc.encoding = "utf-8"
title = doc.at_xpath('/html/body/div/div[9]/div[2]/div/div/div/h1/a')
puts title.content.to_s.strip.gsub(/[^0-9a-z%&!\n\/(). ]/i, '')
Please do tell me if I missed something out. Thank you.

Your xpath is too rigid and your regex is removing chars you want to keep. Here's how I would do it:
title = doc.at('div#contentDealTitle h1 a').text.strip.gsub(/\s+/,' ')
That says take the text from the first a tag that comes after div#contentDealTitle and h1, strip it (remove leading and trailing spaces) and replace all sequences of 1 or more whitespace char with a single space.

Related

Extracting RSS link with Nokogiri

I am using Nokogiri to extract the RSS link from a webpage. However, since some websites have absolute paths and others relative on their HTML, I wanted to make it so that if the website has a relative path it will be made absolute.
Here is my code:
require 'nokogiri'
require 'simple-rss'
require 'open-uri'
ARGV.map! { |http| "http://#{http}"}
ARGV.each do |website|
doc = Nokogiri::HTML(open(website))
rss_path = doc.xpath("//link[#type=\"application/rss+xml\"]").map do |link|
if link['href'] =~ /^http:\/\/[a-z]*\..*\//i
puts link['href']
else
puts "#{website}#{link['href']}"
end
end
So if I was on command line, I would type something like
ruby rss.rb 8gramgorilla.com rubyweekly.com
The code works fine for rubyweekly.com which has a relative path for its RSS but 8gramgorilla.com has an absolute path and so I would want it to just be output immediately, not have http://8gramgorilla.com/http://8gramgorilla.com/feed as the output. Basically, what's going on is that the IF statement is being ignored and it goes right away to the else statement.
The if statement isn’t being ignored, it is evaluating to false. Your regexp is /^http:\/\/[a-z]*\..*\//i, so it is looking for http:// followed by any number of a-z (or a . since zero a-z will also match). But the website url is http://8gramgorilla.com, the first character is the digit 8, which isn’t in the range a-z.
The most direct fix to this would be to change your regex to include digits, perhaps something like /^http:\/\/[\da-z]*\..*\//i (where \d has been added).
You might be able to simplify the regex more, perhaps simply checking to see if the url matches http:// at the start would be enough.
A more robust solution would be to properly parse the url in question, perhaps using the Addressable gem or the URI module in Ruby’s standard lib.
There's no need for the if, just do:
require 'uri'
puts URI.join(website, link['href']).to_s
To detect the RSS feed for the New York Times http://www.nytimes.com:
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml">
I would use the following to extract the href value from application/rss+xml link tag:
require 'nokogiri'
require 'httparty'
url = 'http://www.nytimes.com'
resp = HTTParty.get(url)
doc = Nokogiri::HTML(resp.body)
feed = doc.css("link[type='application/rss+xml']").map{|link|link[:href]}.first
Which would return the RSS feed value for the site:
http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml
Note, if the site does not have application/rss+xml tag, the code will simply return nil.

Getting all links of a webpage using Ruby

I'm trying to retrieve every external link of a webpage using Ruby. I'm using String.scan with this regex:
/href="https?:[^"]*|href='https?:[^']*/i
Then, I can use gsub to remove the href part:
str.gsub(/href=['"]/)
This works fine, but I'm not sure if it's efficient in terms of performance. Is this OK to use or I should work with a more specific parser (nokogiri, for example)? Which way is better?
Thanks!
Using regular expressions is fine for a quick and dirty script, but Nokogiri is very simple to use:
require 'nokogiri'
require 'open-uri'
fail("Usage: extract_links URL [URL ...]") if ARGV.empty?
ARGV.each do |url|
doc = Nokogiri::HTML(open(url))
hrefs = doc.css("a").map do |link|
if (href = link.attr("href")) && !href.empty?
URI::join(url, href)
end
end.compact.uniq
STDOUT.puts(hrefs.join("\n"))
end
If you want just the method, refactor it a little bit to your needs:
def get_links(url)
Nokogiri::HTML(open(url).read).css("a").map do |link|
if (href = link.attr("href")) && href.match(/^https?:/)
href
end
end.compact
end
I'm a big fan of Nokogiri, but why reinvent the wheel?
Ruby's URI module already has the extract method to do this:
URI::extract(str[, schemes][,&blk])
From the docs:
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test#example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test#example.com"]
You could use Nokogiri to walk the DOM and pull all the tags that have URLs, or have it retrieve just the text and pass it to URI.extract, or just let URI.extract do it all.
And, why use a parser, such as Nokogiri, instead of regex patterns? Because HTML, and XML, can be formatted in a lot of different ways and still render correctly on the page or effectively transfer the data. Browsers are very forgiving when it comes to accepting bad markup. Regex patterns, on the other hand, work in very limited ranges of "acceptability", where that range is defined by how well you anticipate the variations in the markup, or, conversely, how well you anticipate the ways your pattern can go wrong when presented with unexpected patterns.
A parser doesn't work like a regex. It builds an internal representation of the document and then walks through that. It doesn't care how the file/markup is laid out, it does its work on the internal representation of the DOM. Nokogiri relaxes its parsing to handle HTML, because HTML is notorious for being poorly written. That helps us because with most non-validating HTML Nokogiri can fix it up. Occasionally I'll encounter something that is SO badly written that Nokogiri can't fix it correctly, so I'll have to give it a minor nudge by tweaking the HTML before I pass it to Nokogiri; I'll still use the parser though, rather than try to use patterns.
Mechanize uses Nokogiri under the hood but has built-in niceties for parsing HTML, including links:
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://example.com/')
page.links_with(:href => /^https?/).each do |link|
puts link.href
end
Using a parser is generally always better than using regular expressions for parsing HTML. This is an often-asked question here on Stack Overflow, with this being the most famous answer. Why is this the case? Because constructing a robust regular expression that can handle real-world variations of HTML, some valid some not, is very difficult and ultimately more complicated than a simple parsing solution that will work for just about all pages that will render in a browser.
why you dont use groups in your pattern?
e.g.
/http[s]?:\/\/(.+)/i
so the first group will already be the link you searched for.
Can you put groups in your regex? That would reduce your regular expressions to 1 instead of 2.

Nokogiri, open-uri, and Unicode Characters

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")
At this point, the title looks like this:
Rag\303\271
Instead of:
Ragù
How can I have nokogiri return the proper character (e.g. ù in this case)?
Here's an example URL:
http://www.epicurious.com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037
Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.
Analysis:
If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealogà a de Jesucristo
#=> UTF-8
We can see that this is not the fault of open-uri:
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/Gene\S+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8
This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:
# encoding: UTF-8
require 'nokogiri'
require 'open-uri'
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true
I was having the same problem and the Iconv approach wasn't working. Nokogiri::HTML is an alias to Nokogiri::HTML.parse(thing, url, encoding, options).
So, you just need to do:
doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')
and it'll convert the page encoding properly to utf-8. You'll see Ragù instead of Rag\303\271.
When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.
If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.
If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.
For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string
for 1.8, you probably need to look at Iconv.
Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:
require 'win32ole'
WIN32OLE.codepage = WIN32OLE::CP_UTF8
If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.
Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.
Try setting the encoding option of Nokogiri, like so:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
doc.encoding = 'utf-8'
title = doc.at_css("title")
Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.
(The accented characters in your link came through fine in both, so don't know if this would help you with that.)
EXAMPLE:
url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'
doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.
as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:
require 'iconv'
doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))
Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping
Just to add a cross-reference, this SO page gives some related information:
How to make Nokogiri transparently return un/encoded Html entities untouched?
Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.
Check it out: https://github.com/tiagopog/scrapifier
Hope it's useful for you.

How can I get Nokogiri to parse and return an XML document?

Here's a sample of some oddness:
#!/usr/bin/ruby
require 'rubygems'
require 'open-uri'
require 'nokogiri'
print "without read: ", Nokogiri(open('http://weblog.rubyonrails.org/')).class, "\n"
print "with read: ", Nokogiri(open('http://weblog.rubyonrails.org/').read).class, "\n"
Running this returns:
without read: Nokogiri::XML::Document
with read: Nokogiri::HTML::Document
Without the read returns XML, and with it is HTML? The web page is defined as "XHTML transitional", so at first I thought Nokogiri must have been reading OpenURI's "content-type" from the stream, but that returns 'text/html':
(rdb:1) doc = open(('http://weblog.rubyonrails.org/'))
(rdb:1) doc.content_type
"text/html"
which is what the server is returning. So, now I'm trying to figure out why Nokogiri is returning two different values. It doesn't appear to be parsing the text and using heuristics to determine whether the content is HTML or XML.
The same thing is happening with the ATOM feed pointed to by that page:
(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document
(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails').read)
(rdb:1) doc.class
Nokogiri::HTML::Document
I need to be able to parse a page without knowing what it is in advance, either HTML or a feed (RSS or ATOM) and reliably determine which it is. I asked Nokogiri to parse the body of either a HTML or XML feed file, but I'm seeing those inconsistent results.
I thought I could write some tests to determine the type but then I ran into xpaths not finding elements, but regular searches working:
(rdb:1) doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
(rdb:1) doc.class
Nokogiri::XML::Document
(rdb:1) doc.xpath('/feed/entry').length
0
(rdb:1) doc.search('feed entry').length
15
I figured xpaths would work with XML but the results don't look trustworthy either.
These tests were all done on my Ubuntu box, but I've seen the same behavior on my Macbook Pro. I'd love to find out I'm doing something wrong, but I haven't seen an example for parsing and searching that gave me consistent results. Can anyone show me the error of my ways?
It has to do with the way Nokogiri's parse method works. Here's the source:
# File lib/nokogiri.rb, line 55
def parse string, url = nil, encoding = nil, options = nil
doc =
if string =~ /^\s*<[^Hh>]*html/i # Probably html
Nokogiri::HTML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_HTML)
else
Nokogiri::XML::Document.parse(string, url, encoding, options || XML::ParseOptions::DEFAULT_XML)
end
yield doc if block_given?
doc
end
The key is the line if string =~ /^\s*<[^Hh>]*html/i # Probably html. When you just use open, it returns an object that doesn't work with regex, thus it always returns false. On the other hand, read returns a string, so it could be regarded as HTML. In this case it is, because it matches that regex. Here's the start of that string:
<!DOCTYPE html PUBLIC
The regex matches the "!DOCTYPE " to [^Hh>]* and then matches the "html", thus assuming it's HTML. Why someone selected this regex to determine if the file is HTML is beyond me. With this regex, a file that begins with a tag like <definitely-not-html> is considered HTML, but <this-is-still-not-html> is considered XML. You're probably best off staying away from this dumb function and invoking Nokogiri::HTML::Document#parse or Nokogiri::XML::Document#parse directly.
Responding to this part of your question:
I thought I could write some tests to
determine the type but then I ran into
xpaths not finding elements, but
regular searches working:
I've just come across this problem using Nokogiri to parse an Atom feed. The problem seemed down to the anonymous name-space declaration:
<feed xmlns="http://www.w3.org/2005/Atom">
Removing the XMLNS declaration from the source XML would enable Nokogiri to search with XPath as usual. Removing that declaration from the feed obviously wasn't an option here, so instead I just removed the namespaces from the document after parsing:
doc = Nokogiri.parse(open('http://feeds.feedburner.com/RidingRails'))
doc.remove_namespaces!
doc.xpath('/feed/entry').length
Ugly I know, but it did the trick.

How do I extract links from HTML using regex?

I want to extract links from google.com; My HTML code looks like this:
<a href="http://www.test.com/" class="l"
I took me around five minutes to find a regex that works using www.rubular.com.
It is:
"(.*?)" class="l"
The code is:
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read()
links = source.scan(/"(.*?)" class="l"/)
links.each { |link| puts #{link}
}
The problem is, is it not outputting the websites links.
Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.
require "open-uri"
url = "http://www.google.com/search?q=ruby"
source = open(url).read
puts "--- PAGE SOURCE ---"
puts source
links = source.scan(/<a.+?href="(.+?)".+?class=l/)
puts "--- FOUND THIS MANY LINKS ---"
puts links.size
puts "--- PRINTING LINKS ---"
links.each do |link|
puts "- #{link}"
end
I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).
I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.
To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.
What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.
What im going wrong?
You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.
FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

Resources