How to print XPath search? - ruby

I'm trying to parse an XML file with Nokogiri:
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
#doc.xpath("//event[#league='*LOL*']")
print #doc.text
which works and prints all the events that contain "LOL" in the "league" attribute, but when I create a block, it runs but prints nothing:
#doc.xpath("//event[#league='*LOL*']").each do |league_element|
puts "\n"+league_element.xpath('league').text
end

require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
events = #doc.xpath("//event[#league='*LOL*']")
puts #doc.children
.children "returns a new NodeSet containing all the children of all the nodes in the NodeSet." you can keep filtering node names and values using children.xpath()
For example:
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
events = #doc.xpath("//event[#league='*LOL*']")
puts #doc.children.xpath('//league').text
=> LOL Cham Kor
=> LOL Cham Kor
=> ....
Or
#doc.children.each do |item|
puts item.xpath('//league')
end

Related

Nokogiri example not showing array (Ruby)

When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)
Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }

How do I get this Nokogiri output to write each object to a column in a csv?

I have this code here which outputs a CSV, but when I open the CSV file its just has a 0 in the first two columns.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-
companies.html"))
puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
link= pharma_links.each{|link| puts link['href'] }
company = pharma_links.each{|link| puts link.text}
CSV.open("/Users/file.csv", "wb") do |csv|
csv << [company, link]
end
The problem is that pharma_links.each{|link| ...} returns the ENTIRE enumerator, so if you do this once for company and once for link you now have two new arrays. You then have to re-map each company & link in a new array / hash (or by index if you are lazy AND you know for certain nothing went wrong in the either .each call)
To avoid this, simply construct the CSV while you are looping through the data. For each line of the CSV you expect one pharma_links 'line', so iterate through each at the same time:
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-companies.html"))
# puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
# Create the CSV and iterate through the links while creating it
# You can also add headers to the CSV on instantiation
CSV.open("file.csv", "wb", write_headers: true, headers: ['url','description']) do |csv|
pharma_links.each do |link|
puts "Adding #{link.text}" # prove that it works :)
csv << [link['href'], link.text]
end
end

Extract by <p> between h3 content nokogiri

I am trying to extract only the <p> that exist between Vigentes and Finalizados without achieving it.
require 'nokogiri'
require 'open-uri'
require 'time'
#url = "http://www.caru.org.uy/web/servicios/llamados-a-concurso-publico-para-contratar-personal/"
page = Nokogiri::HTML(open(#url))
div_content = page.css('.contenido')
div_content.each do |item|
puts item.text
break if item.css('h3').text == "Finalizados"
end
You should be able to do:
css = 'h3:contains(Vigentes) ~ p:has(~ h3:contains(Finalizados))'
But unfortunately, nokogiri doesn't behave properly for this one so we'll use xpath:
xpath = "//h3[contains(text(), 'Vigentes')]/following-sibling::p[./following-sibling::h3[contains(text(), 'Finalizados')]]"
page.search(xpath).each do |p|
# do something
end

Ruby: How do I parse links with Nokogiri with content/text all the same?

What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.

nokogiri not recognising classes with hyphens

require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.priceangels.com/site-map.html"
doc = Nokogiri::HTML(open(url))
doc.css('.lav1').each do |item|
puts item.text
end
doc.css('.masonry-brick').each do |item|
puts item.text
end
This is my first time using nokogiri. The first each loop behaves as expected. The second each loop fails to find any matches.
Does Nokogiri not recognise class names with dashes (hyphens)?
How do I get nokogiri to find the '.masonry-brick' classes?
doc.css("ul.sitemap-item a").each do |me|
puts me.text
end
Is this what you were looking for?
also
<div class="hello world">
doc.css("div[#class='hello world']")
You can use that if you're having problems with spaces.

Resources