Nokogiri example not showing array (Ruby) - ruby

When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)

Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }

Related

Web Crawling. Not able to crawl the data correctly

require 'open-uri'
require 'nokogiri'
url = "https://grofers.com/cn/grocery-staples/cid/16"
response = open(url).read
parsed_data = Nokogiri::HTML(response)
results = []
content = parsed_data.css('.section-right').css('.products products--grid').each do |row|
title = row.css('a.product__wrapper').text
puts title
results << content
end

How do I get this Nokogiri output to write each object to a column in a csv?

I have this code here which outputs a CSV, but when I open the CSV file its just has a 0 in the first two columns.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-
companies.html"))
puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
link= pharma_links.each{|link| puts link['href'] }
company = pharma_links.each{|link| puts link.text}
CSV.open("/Users/file.csv", "wb") do |csv|
csv << [company, link]
end
The problem is that pharma_links.each{|link| ...} returns the ENTIRE enumerator, so if you do this once for company and once for link you now have two new arrays. You then have to re-map each company & link in a new array / hash (or by index if you are lazy AND you know for certain nothing went wrong in the either .each call)
To avoid this, simply construct the CSV while you are looping through the data. For each line of the CSV you expect one pharma_links 'line', so iterate through each at the same time:
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-companies.html"))
# puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
# Create the CSV and iterate through the links while creating it
# You can also add headers to the CSV on instantiation
CSV.open("file.csv", "wb", write_headers: true, headers: ['url','description']) do |csv|
pharma_links.each do |link|
puts "Adding #{link.text}" # prove that it works :)
csv << [link['href'], link.text]
end
end

How to print XPath search?

I'm trying to parse an XML file with Nokogiri:
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
#doc.xpath("//event[#league='*LOL*']")
print #doc.text
which works and prints all the events that contain "LOL" in the "league" attribute, but when I create a block, it runs but prints nothing:
#doc.xpath("//event[#league='*LOL*']").each do |league_element|
puts "\n"+league_element.xpath('league').text
end
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
events = #doc.xpath("//event[#league='*LOL*']")
puts #doc.children
.children "returns a new NodeSet containing all the children of all the nodes in the NodeSet." you can keep filtering node names and values using children.xpath()
For example:
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
events = #doc.xpath("//event[#league='*LOL*']")
puts #doc.children.xpath('//league').text
=> LOL Cham Kor
=> LOL Cham Kor
=> ....
Or
#doc.children.each do |item|
puts item.xpath('//league')
end

How do I parse XML nodes from an API request?

How do I save the information from an XML page that I got from a API?
The URL is "http://api.url.com?number=8-6785503" and it returns:
<OperatorDataContract xmlns="http://psgi.pts.se/PTS_Number_Service" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>Tele2 Sverige AB</Name>
<Number>8-6785503</Number>
</OperatorDataContract>
How do I parse the Name and Number nodes to a file?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://api.url.com?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exporterad.txt", "w") do |file|
doc.xpath("//*").each do |item|
title = item.xpath('//result[group_name="Name"]')
phone = item.xpath("/Number").text.strip
puts "#{title} ; \n"
puts "#{phone} ; \n"
company = " #{title}; #{phone}; \n\n"
file.write(company.gsub(/^\s+/,''))
end
end
Besides the fact that your code isn't valid Ruby, you're making it a lot harder than necessary, at least for a simple scrape and save:
require 'nokogiri'
require 'open-uri'
url = "http://api.pts.se/PTSNumberService/Pts_Number_Service.svc/pox/SearchByNumber?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exported.txt", "w") do |file|
name = doc.at('Name').text
number = doc.at('Number').text
file.puts name
file.puts number
end
Running that results in a file called "exported.txt" that contains:
Tele2 Sverige AB
8-6785503
You can build upon that as necessary.

Web Scraping with Nokogiri::HTML and Ruby - save images

I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/

Resources