Nokogiri : find all the anchors that match a name - ruby

I'm trying to save the links only of the sample pages in this website
MusicRadar
require 'open-uri'
require 'nokogiri'
link = 'https://www.musicradar.com/news/tech/free-music-samples-royalty-free-loops-hits-and-multis-to-download'
html = OpenURI.open_uri(link)
doc = Nokogiri::HTML(html)
#used grep because every sample link in that page ends with '-samples'
doc.xpath('//div/a/#href').grep(/-samples/)
The problem is that it only finds 3 of that links
What am I doing wrong?
And If i wanted to open each of that links?

CSS selectors are more useful than XPath (if the document structure is good enough for that)
Now you used XPath with similar to CSS selector div > a, but you don't need it because for example some of the links inside p
If you need all links with -samples you can use *= selector
doc.css('a[href*="-samples"]') # return Nokogiri::XML::NodeSet with matched elements
doc.css('a[href*="-samples"]').map { |a| a[:href] } # return array of URLS

Related

scanning a webpage for urls with ruby and regex

I'm trying to create an array of all links found at the below url. Using page.scan(URI.regexp) or URI.extract(page) returns more than just urls.
How do I get just the urls?
require 'net/http'
require 'uri'
uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)
If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:
require 'nokogiri'
require 'open-uri'
# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))
# Extract all a-elements (HTML links)
all_links = doc.css('a')
# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
sort.delete_if { |h| h.empty? }
# Print out some of them
puts links.grep(/store/)
http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

How do I get the tag name and CSS classes from Nokogiri::HTML

I've been trying to parse these HTML files with Nokogiri. This is the code I was using
require 'nokogiri'
doc = Nokogiri::HTML File.open('usc...html', 'r')
children = doc.css('body div')
children.each do |child|
puts child.name
end
That prints div for all of the child elements even though they are almost entire p, h3 and h4 tags. Can someone explain why that is happening? Also, how do I get the CSS classes off of them?
This:
doc.css('body div')
Will select every div on the page. If you want every element you should use:
doc.css('*')
You can get at the css class with child[:class]

Using Mechanize gem to return a collection of links based on their position in the DOM

I am struggling with mechanize. I wish to "click" on a set of links which can only be identified by their position (all links within div#content) or their href.
I have tried both of these identification methods above without success.
From the documentation, I could not figure out how return a collection of links (for clicking) based on their position in the DOM, and not by attributes directly on the link.
Secondly, the documentation suggested you can you use :href to match a partial href,
page = agent.get('http://foo.com/').links_with(:href => "/something")
but the only way I can get it to return a link is by passing a fully qualified URL, e.g
page = agent.get('http://foo.com/').links_with(:href => "http://foo.com/something/a")
This is not very usefull if i want to return a collection of links with href's
http://foo.com/something/a
http://foo.com/something/b
http://foo.com/something/c
etc...
Am I doing something wrong? do I have unrealistic expectations?
Part II
The value you pass to :href has to be an exact match by default. So the href in your example would only match and not
What you want to do is to pass in a regex so that it will match a substring within the href field. Like so:
page = agent.get('http://foo.com/').links_with(:href => %r{/something/})
edit:
Part I
In order to get it to select links only in a link, add a nokogiri-style search method into your string. Like this:
page = agent.get('http://foo.com/').search("div#content").links_with(:href => %r{/something/}) # **
Ok, that doesn't work because after you do page = agent.get('http://foo.com/').search("div#content") you get a Nokogiri object back instead of a mechanize one, so links_with won't work. However you will be able to extract the links from the Nokogiri object using the css method. I would suggest something like:
page = agent.get('http://foo.com/').search("div#content").css("a")
If that doesn't work, I'd suggest checking out http://nokogiri.org/tutorials
The nth link:
page.links[n-1]
The first 5 links:
page.links[0..4]
links with 'something' in the href:
page.links_with :href => /something/
You can get mechanize links using nokogiri nodes. See the source code of links() method.
# File lib/mechanize/page.rb, line 352
def links
#links ||= %w{ a area }.map do |tag|
search(tag).map do |node|
Link.new(node, #mech, self)
end
end.flatten
end
So that means:
the_links= page.search("valid_selector").map do |node|
Mechanize::Page::Link.new(node, agent, page)
end
This will give you the useful href, text and uri methods.

Resources