scanning a webpage for urls with ruby and regex - ruby

I'm trying to create an array of all links found at the below url. Using page.scan(URI.regexp) or URI.extract(page) returns more than just urls.
How do I get just the urls?
require 'net/http'
require 'uri'
uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)

If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:
require 'nokogiri'
require 'open-uri'
# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))
# Extract all a-elements (HTML links)
all_links = doc.css('a')
# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
sort.delete_if { |h| h.empty? }
# Print out some of them
puts links.grep(/store/)
http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...

Related

Nokogiri : find all the anchors that match a name

I'm trying to save the links only of the sample pages in this website
MusicRadar
require 'open-uri'
require 'nokogiri'
link = 'https://www.musicradar.com/news/tech/free-music-samples-royalty-free-loops-hits-and-multis-to-download'
html = OpenURI.open_uri(link)
doc = Nokogiri::HTML(html)
#used grep because every sample link in that page ends with '-samples'
doc.xpath('//div/a/#href').grep(/-samples/)
The problem is that it only finds 3 of that links
What am I doing wrong?
And If i wanted to open each of that links?
CSS selectors are more useful than XPath (if the document structure is good enough for that)
Now you used XPath with similar to CSS selector div > a, but you don't need it because for example some of the links inside p
If you need all links with -samples you can use *= selector
doc.css('a[href*="-samples"]') # return Nokogiri::XML::NodeSet with matched elements
doc.css('a[href*="-samples"]').map { |a| a[:href] } # return array of URLS

Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

Web Scraping with Nokogiri::HTML and Ruby - save images

I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/

Nokogiri multiple domains

Is it possible to do multi domain searches using Nokogiri. I am aware you can do multiple Xpath/CSS searches for a single domain/page but multi domain?
For example I want to scrape http://www.asus.com/Notebooks_Ultrabooks/S56CA/#specifications and http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications
My Code
require 'nokogiri'
require 'open-uri'
require 'spreadsheet'
doc = Nokogiri::HTML(open("http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications"))
#Grab our product specifications
data = doc.css('div#specifications div#spec-area ul.product-spec li')
#Modify our data
lines = data.map(&:text)
#Create the Spreadsheet
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1.name = 'My First Worksheet'
#Output our data to the Spreadsheet
lines.each.with_index do |line, i|
sheet1[i, 0] = line
end
book.write 'C:/Users/Barry/Desktop/output.xls'
Nokogiri has no concept of URLs, it only knows about a String or IO stream of XML or HTML. You're confusing OpenURI's purpose with Nokogiri's.
If you want to read from multiple sites, simply loop over the URLs, and pass the current URL to OpenURI to open the page:
%w[
http://www.asus.com/Notebooks_Ultrabooks/S56CA/#specifications
http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications
].each do |url|
doc = Nokogiri::HTML(open(url))
# do somethng with the document...
end
OpenURI will read the page, and pass its contents to Nokogiri for parsing. Nokogiri will still only see one page at a time, because that's all it is passed by OpenURI.

Ruby Regex Help

I want to Extract the Members Home sites links from a site.
Looks like this
<a href="http://www.ptop.se" target="_blank">
i tested with it this site
http://www.rubular.com/
<a href="(.*?)" target="_blank">
Shall output http://www.ptop.se,
Here comes the code
require 'open-uri'
url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
open(url) { |page| content = page.read()
links = content.scan(/<a href="(.*?)" target="_blank">/)
links.each {|link| puts #{link}
}
}
if you run this, it dont works. why not?
I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.
If you need to log in on the site you might be interested in a library like WWW::Mechanize.
Code example:
require "open-uri"
require "hpricot"
require "nokogiri"
url = "http://itproffs.se/forumv2"
# Using Hpricot
doc = Hpricot(open(url))
doc.search("//a[#target='_blank']").each { |user| puts "found #{user.inner_html}" }
# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[#target='_blank']").each { |user| puts "found #{user.text}" }
Several issues with your code
I don't know what you mean by using
{link}. But if you want to append a '#' character to the link make sure
you wrap that with quotes. ie
"#{link}"
String.scan accepts a block. Use it
to loop through the matches.
The page you are trying to access
does not return any links that the
regex would match anyway.
Here's something that would work:
require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
content = page.read()
content.scan(/<a href="(.*?)" target="_blank">/) do |match|
match.each { |link| puts link}
end
end
There're better ways to do it, I am sure. But this should work.
Hope it helps

Resources