How do I save the information from an XML page that I got from a API?
The URL is "http://api.url.com?number=8-6785503" and it returns:
<OperatorDataContract xmlns="http://psgi.pts.se/PTS_Number_Service" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Name>Tele2 Sverige AB</Name>
<Number>8-6785503</Number>
</OperatorDataContract>
How do I parse the Name and Number nodes to a file?
Here is my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://api.url.com?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exporterad.txt", "w") do |file|
doc.xpath("//*").each do |item|
title = item.xpath('//result[group_name="Name"]')
phone = item.xpath("/Number").text.strip
puts "#{title} ; \n"
puts "#{phone} ; \n"
company = " #{title}; #{phone}; \n\n"
file.write(company.gsub(/^\s+/,''))
end
end
Besides the fact that your code isn't valid Ruby, you're making it a lot harder than necessary, at least for a simple scrape and save:
require 'nokogiri'
require 'open-uri'
url = "http://api.pts.se/PTSNumberService/Pts_Number_Service.svc/pox/SearchByNumber?number=8-6785503"
doc = Nokogiri::XML(open(url))
File.open("exported.txt", "w") do |file|
name = doc.at('Name').text
number = doc.at('Number').text
file.puts name
file.puts number
end
Running that results in a file called "exported.txt" that contains:
Tele2 Sverige AB
8-6785503
You can build upon that as necessary.
Related
When I try to run this via terminal I can parse/display the data but when I type in pets_array = []
I am not seeing anything
My code is as follows:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://newyork.craigslist.org/search/pet?s=0')
parse_page = Nokogiri::HTML(page)
pets_array = []
parse_page.css('.content').css('.row').css('.result-title hdrlnk').map do |a|
post_name = a.text
pets_array.push(post_name)
end
CSV.open('pets.csv', 'w') do |csv|
csv << pets_array
end
Pry.start(binding)
Maybe to be precise you could access each anchor tag with class .result-title.hdrlnk inside .result-info, .result-row, .rows and .content:
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
p pets_array
# ["Mini pig", "Black Russian Terrier", "2 foster or forever homes needed Asap!", ...]
As you're using map, you can use the pets_array variable to store the text on each iterated element, no need to push.
If you want to write the data stored in the array, then you can push is directly, no need to redefined as an empty array (the reason because you get a blank csv file):
require 'httparty'
require 'nokogiri'
require 'csv'
page = HTTParty.get 'https://newyork.craigslist.org/search/pet?s=0'
parse_page = Nokogiri::HTML page
pets_array = parse_page.css('.content .rows .result-row .result-info .result-title.hdrlnk').map &:text
CSV.open('pets.csv', 'w') { |csv| csv << pets_array }
I have this code here which outputs a CSV, but when I open the CSV file its just has a 0 in the first two columns.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-
companies.html"))
puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
link= pharma_links.each{|link| puts link['href'] }
company = pharma_links.each{|link| puts link.text}
CSV.open("/Users/file.csv", "wb") do |csv|
csv << [company, link]
end
The problem is that pharma_links.each{|link| ...} returns the ENTIRE enumerator, so if you do this once for company and once for link you now have two new arrays. You then have to re-map each company & link in a new array / hash (or by index if you are lazy AND you know for certain nothing went wrong in the either .each call)
To avoid this, simply construct the CSV while you are looping through the data. For each line of the CSV you expect one pharma_links 'line', so iterate through each at the same time:
require 'nokogiri'
require 'open-uri'
require 'csv'
page = Nokogiri::HTML(open("https://www.drugs.com/pharmaceutical-companies.html"))
# puts page.class #=> Nokogiri::HTML::Document
pharma_links = page.css("div.col-list-az a")
# Create the CSV and iterate through the links while creating it
# You can also add headers to the CSV on instantiation
CSV.open("file.csv", "wb", write_headers: true, headers: ['url','description']) do |csv|
pharma_links.each do |link|
puts "Adding #{link.text}" # prove that it works :)
csv << [link['href'], link.text]
end
end
I am a Ruby-newbie and I tried my first scraper today. It's a scraper designed to store recipes in a CSV file. Nevertheless, I can't figure out why it doesn't work. here is my code:
recipe.rb :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
CSV.open('recueil.csv', 'wb') do |csv|
csv << recipes
end
end
end
write_csv('chocolat')
Thank you so much for your answers, it'll help me a lot !
IT WORKED ! I changed my code as below, using a hash :
require 'csv'
require 'nokogiri'
require 'open-uri'
def write_csv(ingredient)
recipes= []
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes << {
name: name,
description: description,
difficulty: difficulty
}
end
CSV.open('recueil.csv','a') do |csv|
csv << ["name", "description", "cooking_time", "difficulty"]
recipes.each do |recipe|
csv << [
recipe[:name],
recipe[:description],
recipe[:cooking_time],
recipe[:difficulty]
]
end
end
end
write_csv('chocolat')
When you are opening your CSV file you are overwriting the previous one every time. You should eighter append to the file like this:
CSV.open('recueil.csv', 'a') do |csv|
or you could open it before you start looping like this:
def write_csv(ingredient)
doc = Nokogiri::HTML(open("http://www.marmiton.org/recettes/recherche.aspx?aqt=#{ingredient}"), nil, 'utf-8')
csv = CSV.open('recueil.csv', 'wb')
doc.search(".m_contenu_resultat").first(10).each do |item|
name = item.search('.m_titre_resultat a').text
description = item.search('.m_texte_resultat').text
cooking_time = item.search('.m_detail_time').text
diff = item.search('.m_detail_recette').text.split('-')
difficulty = diff[2]
recipes = [name, description, cooking_time, difficulty]
csv << recipes
end
csv.close
end
You don't specify what doesn't work, what the result of the errors are, so I must speculate.
I tried your script and had difficulties with the encoding, since the site is in french, there are lots of special characters.
Try again with this at the head of your script, it should solve at least that problem.
# encoding: utf-8
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/
EDIT: My original question was way off, my apologies. Mark Reed has helped me find out the real problem, so here it is.
Note that this code works:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
source_url = "www.flickr.com"
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts textarea
create_file.close
Which is really awesome, but I need it to do this to ~110 URLs, not just Flickr. Here's my loop that isn't working:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
File.open('sources.txt').each_line do |source_url|
puts "Visiting #{source_url}"
page = Nokogiri::HTML(open("http://website/script.php?value=#{source_url}"))
textarea = page.css('textarea')
filename = source_url.to_s + ".txt"
create_file = File.open("#{filename}", 'w')
create_file.puts "#{textarea}"
create_file.close
end
What am I doing wrong with my loop?
Ok, now you're looping over the lines of the input file. When you do that, you get strings that end in a newilne. So you're trying to create a file with a newline in the middle of its name, which is not legal in Windows.
Just chomp the string:
File.open('sources.txt').each_line do |source_url|
source_url.chomp!
# ... rest of code goes here ...
You can also use File#foreach instead of File#open.each_line:
File.foreach('sources.txt') do |source_url|
source_url.chomp!
# ... rest of code goes here
You're putting your parentheses in the wrong place:
create_file = File.open(variable, 'w')