Getting page title with Ruby - ruby

I am trying to get what's inside of the title tag but I can't get to do it. I am following some of the answers around stackoverflow that are supposed to work but for me they don't.
This is what I am doing:
require "open-uri"
require "uri"
def browse startpage, depth, block
if depth > 0
begin
open(startpage){ |f|
block.call startpage, f
}
rescue
return
end
end
end
browse("https://www.ruby-lang.org/es/", 2, lambda { |page_name, web|
puts "Header information:"
puts "Title: #{web.to_s.scan(/<title>(.*?)<\/title>/)}"
puts "Base URI: #{web.base_uri}"
puts "Content Type: #{web.content_type}"
puts "Charset: #{web.charset}"
puts "-----------------------------"
})
The title output is just [], why?

open returns a File object or passes it to the block (actually a Tempfile but that doesn't matter). Calling to_s just returns a string containing the object's class and its id:
open('https://www.ruby-lang.org/es/') do |f|
f.to_s
end
#=> "#<File:0x007ff8e23bfb68>"
Scanning that string for a title is obviously useless:
"#<File:0x007ff8e23bfb68>".scan(/<title>(.*?)<\/title>/)
Instead, you have to read the file's content:
open('https://www.ruby-lang.org/es/') do |f|
f.read
end
#=> "<!DOCTYPE html>\n<html>\n...</html>\n"
You can now scan the content for a <title> tag:
open('https://www.ruby-lang.org/es/') do |f|
str = f.read
str.scan(/<title>(.*?)<\/title>/)
end
#=> [["Lenguaje de Programaci\xC3\xB3n Ruby"]]
or, using Nokogiri: (because You can't parse [X]HTML with regex)
open('https://www.ruby-lang.org/es/') do |f|
doc = Nokogiri::HTML(f)
doc.at_css('title').text
end
#=> "Lenguaje de ProgramaciĆ³n Ruby"

If you must insist on using open-uri, this one liner than get you the page title:
2.1.4 :008 > puts open('https://www.ruby-lang.org/es/').read.scan(/<title>(.*?)<\/title>/)
Lenguaje de ProgramaciĆ³n Ruby
=> nil
If you want to use something more complicated than this, please use nokogiri or mechanize. Thanks

Related

How do I scrape a website and output data to xml file with Nokogiri?

I've been trying to scrape data using Nokogiri and HTTParty and can scrape data off a website successfully and print it to the console but I can't work out how to output the data to an xml file in the repo.
Right now the code looks like this:
class Scraper
attr_accessor :parse_page
def initialize
doc = HTTParty.get("https://store.nike.com/gb/en_gb/pw/mens-nikeid-lifestyle-shoes/1k9Z7puZoneZoi3?ref=https%253A%252F%252Fwww.google.com%252F")
#parse_page ||= Nokogiri::HTML(doc)
end
def get_names
item_container.css(".product-display-name").css("p").children.map { |name| name.text }.compact
end
def get_prices
item_container.css(".product-price").css("span.local").children.map { |price| price.text }.compact
end
private
def item_container
parse_page.css(".grid-item-info")
end
scraper = Scraper.new
names = scraper.get_names
prices = scraper.get_prices
(0...prices.size).each do |index|
puts " - - - Index #{index + 1} - - -"
puts "Name: #{names[index]} | Price: #{prices[index]}"
end
end
I've tried changing the .each method to include a File.write() but all it ever does is write the last line of the output into the xml file. I would appreciate any insight as to how to parse the data correctly, I am new to scraping.
I've tried changing the .each method to include a File.write() but all it ever does is write the last line of the output into the xml file.
Is the File.write method inside the each loop? I guess what's happening here is You are overwriting the file on every iteration and that's why you are seeing only the last line.
Try putting the each loop inside the block of the File.open method like:
File.open(yourfile, 'w') do |file|
(0...prices.size).each do |index|
file.write("your text")
end
end
I also recommend reading about the Nokogiri::XML::Builder and then saving it's output to the file.

Finding all links from ten URLs while reading a file

How can I extract all href options in an <a> tag from a page while reading in a file?
If I have a text file that contains the target URLs:
http://mypage.com/1.html
http://mypage.com/2.html
http://mypage.com/3.html
http://mypage.com/4.html
Here's the code I have:
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
# set the page_url to the current line
page = Nokogiri::HTML(open(line))
links = page.css("a")
puts links[0]["href"]
end
end
I'd flip it around. I would first parse the text file and load each line into memory (assuming its a small enough data set). Then create one instance of Nokogiri for your HTML doc and extract out all href attributes (like you are doing).
Something like this untested code:
links = []
hrefs = []
File.open("myfile.txt", "r") do |f|
f.each_line do |line|
links << line
end
end
page = Nokogiri::HTML(html)
page.css("a").each do |tag|
hrefs << tag['href']
end
links.each do |link|
if hrefs.include?(link)
puts "its here"
end
end
If all I wanted to do was output the 'href' for each <a>, I'd write something like:
File.foreach('myfile.txt') do |url|
page = Nokogiri::HTML(open(url))
puts page.search('a').map{ |link| link['href'] }
end
Of course <a> tags don't have to have a 'href' but puts won't care.

Nokogiri and XPath: saving text result of scrape

I would like to save the text results of a scrape in a file. This is my current code:
require "rubygems"
require "open-uri"
require "nokogiri"
class Scrapper
attr_accessor :html, :single
def initialize(url)
download = open(url)
#page = Nokogiri::HTML(download)
#html = #page.xpath('//div[#class = "quoteText"andfollowing-sibling::div[1][#class = "quoteFooter" and .//a[#href and normalize-space() = "hard-work"]]]')
end
def get_quotes
#quotes_array = #html.collect {|node| node.text.strip}
#single = #quotes_array.each do |quote|
quote.gsub(/\s{2,}/, " ")
end
end
end
I know that I can write a file like this:
File.open('text.txt', 'w') do |fo|
fo.write(content)
but I don't know how to incorporate #single which holds the results of my scrape. Ultimate goal is to insert the information into a database.
I have come across some folks using Yaml but I am finding it hard to follow the step to step guide.
Can anyone point me in the right direction?
Thank you.
Just use:
#single = #quotes_array.map do |quote|
quote.squeeze(' ')
end
File.open('text.txt', 'w') do |fo|
fo.puts #single
end
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #quotes_array.map{ |q| q.squeeze(' ') }
end
and don't bother creating #single.
Or:
File.open('text.txt', 'w') do |fo|
fo.puts #html.collect { |node| node.text.strip.squeeze(' ') }
end
and don't bother creating #single or #quotes_array.
squeeze is part of the String class. This is from the documentation:
" now is the".squeeze(" ") #=> " now is the"

Add a class to an element with Nokogiri

Apparently Nokogiri's add_class method only works on NodeLists, making this code invalid:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor.add_class("whatever") # WHOOPS!
end
What can I do to make this code work? I figured it'd be something like
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
Nokogiri::XML::NodeSet.new(anchor).add_class("whatever")
end
but this doesn't work either. Please tell me I don't have to implement my own add_class for single nodes!
A CSS class is just another attribute on an element:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor['class']="whatever"
end
Since CSS classes are space-delimited in the attribute, if you're not sure if one or more classes might already exist you'll need something like
anchor['class'] ||= ""
anchor['class'] = anchor['class'] << " whatever"
You need to explicitly set the attribute using = instead of just mutating the string returned for the attribute. This, for example, will not change the DOM:
anchor['class'] ||= ""
anchor['class'] << " whatever"
Even though it results in more work being done, I'd probably do this like so:
class Nokogiri::XML::Node
def add_css_class( *classes )
existing = (self['class'] || "").split(/\s+/)
self['class'] = existing.concat(classes).uniq.join(" ")
end
end
If you don't want to monkey-patch the class, you could alternatively:
module ClassMutator
def add_css_class( *classes )
existing = (self['class'] || "").split(/\s+/)
self['class'] = existing.concat(classes).uniq.join(" ")
end
end
anchor.extend ClassMutator
anchor.add_css_class "whatever"
Edit: You can see that this is basically what Nokogiri does internally for the add_class method you found by clicking on the class to view the source:
# File lib/nokogiri/xml/node_set.rb, line 136
def add_class name
each do |el|
next unless el.respond_to? :get_attribute
classes = el.get_attribute('class').to_s.split(" ")
el.set_attribute('class', classes.push(name).uniq.join(" "))
end
self
end
Nokogiri's add_class, works on a NodeSet, like you found. Trying to add the class inside the each block wouldn't work though, because at that point you are working on an individual node.
Instead:
require 'nokogiri'
html = '<p>one</p><p>two</p>'
doc = Nokogiri::HTML(html)
doc.search('p').tap{ |ns| ns.add_class('boo') }.each do |n|
puts n.text
end
puts doc.to_html
Which outputs:
# >> one
# >> two
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p class="boo">one</p>
# >> <p class="boo">two</p>
# >> </body></html>
The tap method, implemented in Ruby 1.9+, gives access to the nodelist itself, allowing the add_class method to add the "boo" class to the <p> tags.
Old thread, but it's the top Google hit. You can now do this with the append_class method without having to mess with space-delimiters:
doc.search('a').each do |anchor|
anchor.inner_text = "hello!"
anchor.append_class('whatever')
end

Ruby parameterize if ... then blocks

I am parsing a text file and want to be able to extend the sets of tokens that can be recognized easily. Currently I have the following:
if line =~ /!DOCTYPE/
puts "token doctype " + line[0,20]
#ast[:doctype] << line
elsif line =~ /<html/
puts "token main HTML start " + line[0,20]
html_scanner_off = false
elsif line =~ /<head/ and not html_scanner_off
puts "token HTML header starts " + line[0,20]
html_header_scanner_on = true
elsif line =~ /<title/
puts "token HTML title " + line[0,20]
#ast[:HTML_header_title] << line
end
Is there a way to write this with a yield block, e.g. something like:
scanLine("title", :HTML_header_title, line)
?
Don't parse HTML with regexes.
That aside, there are several ways to do what you're talking about. One:
class Parser
class Token
attr_reader :name, :pattern, :block
def initialize(name, pattern, block)
#name = name
#pattern = pattern
#block = block
end
def process(line)
#block.call(self, line)
end
end
def initialize
#tokens = []
end
def scanLine(line)
#tokens.find {|t| line =~ t.pattern}.process(line)
end
def addToken(name, pattern, &block)
#tokens << Token.new(name, pattern, block)
end
end
p = Parser.new
p.addToken("title", /<title/) {|token, line| puts "token #{token.name}: #{line}"}
p.scanLine('<title>This is the title</title>')
This has some limitations (like not checking for duplicate tokens), but works:
$ ruby parser.rb
token title: <title>This is the title</title>
$
If you're intending to parse HTML content, you might want to use one of the HTML parsers like nokogiri (http://nokogiri.org/) or Hpricot (http://hpricot.com/) which are really high-quality. A roll-your-own approach will probably take longer to perfect than figuring out how to use one of these parsers.
On the other hand, if you're dealing with something that's not quite HTML, and can't be parsed that way, then you'll need to roll your own somehow. There's a few Ruby parser frameworks out there that may help, but for simple tasks where performance isn't a critical factor, you can get by with a pile of regexps like you have here.

Resources