Save Webscraped data - ruby

I am trying to scrape a website. I am able to scrape data from that website. I am having trouble saving the data from the scrape to yaml file that I have included
My Code:
require 'rubygems'
require 'open-uri'
require 'hpricot'
article = []
doc = open("http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"{|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
puts "#{article.inner_html}"
end
File.open('test.yaml', 'w') { |f|
f <<article.to_yaml
}

First you are missing a closing parenthesis for the open call (a ) right before the block starts).
When you add that you'll notice that you'll get a NoMethodError (undefined method 'to_yaml' for []:Array). To fix that you have to require 'yaml', which pulls in the monkey-patches for the Array class. After that you'll notice that your yaml file is empty, because you never put anything into article. Here's a fixed version:
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'yaml'
articles = []
url = "http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"
doc = open(url) {|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
articles << article.inner_html
end
File.open('test.yaml', 'w') { |f| f << articles.to_yaml }

Related

Getting all unique URL's using nokogiri

I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.
Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0]="https://www.nku.edu/academics/informatics.html"
ARGV.each do |arg|
open(arg) do |f|
# Display connection data
puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
[:base_uri, :meta, :status, :charset, :content_encoding,
:content_type, :last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
# Display the href links
base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
puts "base_url: #{base_url}"
Nokogiri::HTML(f).css('a').each do |anchor|
href = anchor['href']
# Make Unique
if href =~ /.*informatics/
puts href
#store stuff to active record
end
end
end
end
Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end
See output.

`open_http': 500 Internal Server Error (OpenURI::HTTPError)

I'm trying to parse a url but keep having this 500. Any suggestion please?
require 'open-uri'
require 'json'
require 'csv'
url = 'https://gist.githubusercontent.com/gregclermont/ca9e8abdff5dee9ba9db/raw/
7b2318efcf8a7048f720bcaff2031d5467a4a2c8/users.json'
encoded_url = URI.encode(url)
open(encoded_url) do |stream|
quote = JSON.parse(stream.read)
puts quote
end
require 'open-uri'
require 'json'
url = 'https://gist.githubusercontent.com/gregclermont/ca9e8abdff5dee9ba9db/raw/7b2318efcf8a7048f720bcaff2031d5467a4a2c8/users.json'
open(url) { |f| JSON.parse(f.read) }
Works fine for me.

Ruby Sinatra - How to capture post json data and save to file

How do I capture a Json data from POST route and save it to file? I have simple ruby sinatra code as below.
#!/usr/bin/env ruby
require 'rubygems'
require 'sinatra'
require 'json'
post '/' do
values = JSON.parse(request.env["rack.input"].read)
# How do I save "values" of JSON to file..
end
Try this
#!/usr/bin/env ruby
require 'rubygems'
require 'sinatra'
require 'json'
post '/' do
values = JSON.parse(request.env["rack.input"].read)
File.open('file.txt', 'w') { |file| file.write(values) }
end
To write file in ruby you can use:
File.open('/your/path/file', 'w') { |file| file.write(values) }

Printing to file from Ruby pp

I'm parsing a JSON file in Ruby and want to output the results using pp to a file. How can I do that? Here's the code I'm trying:
require 'rubygems'
require 'json'
require 'pp'
json = File.read('players.json')
plyrs = JSON.parse(json)
File.open('plyrs.txt', 'a') { |fo| pp page, fo }
require "rubygems" is redundant in Ruby >= 1.9.
require "json"
require "pp"
plyrs = JSON.load("players.json")
File.open("plyrs.txt", "a"){|io| io.write(plyrs.pretty_inspect)}
Try this
require 'rubygems'
require 'json'
require 'pp'
json = File.read('players.json')
plyrs = JSON.parse(json)
File.open('plyrs.txt', 'a') { |file| file.write(pp plyrs) }
More info available at the ruby documentation

Creating a file for each URL of a site using Ruby

I have the following code, which creates a file with the content from a crawled site:
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone|
anemone.on_pages_like(/http:\/\/www.findbrowsenodes.com\/us\/.+\/[\d]*/) do | page |
doc = Nokogiri::HTML(open(page.url))
node_id = doc.at_css("#n_info #clipnode").text unless doc.at_css("#n_info #clipnode").nil?
node_name = doc.at_css("#n_info .node_name").text unless doc.at_css("#n_info .node_name").nil?
node_url = page.url
open("filename.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
end
Now I want to create not one but various files named node_id. I tried this:
page.each do |p|
p.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
but got this:
undefined method `value' for #<Nokogiri::XML::DTD:0x51c089a name="html"> (NoMethodError)
then tried this:
page.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
but got this:
private method `open' called for #<Anemone::Page:0x91472e8> (NoMethodError)
What's the right way of doing this?
File.open("#{node_id}.txt", "w") do |f|
f.puts "stuff"
end
How you make the assignment to node_id is up to you.

Resources