Nokogiri: Slop access a node named name - ruby

I'm trying to parse a xml that looks like this:
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
Using Nokogiri::Slop.
I can access lesson easily through lesson.version but I cannot access lesson.name, as name refer in this case to the name of the node (lesson).
Is there any way to access the child ?

As a variant you could try this one:
doc.lesson.elements.select{|el| el.name == "name"}
Why? Just because of this benchmarks:
require 'nokogiri'
require 'benchmark'
str = '<lesson>
<name>toto</name>
<version>42</version>
</lesson>'
doc = Nokogiri::Slop(str)
n = 50000
Benchmark.bm do |x|
x.report("select") { n.times do; doc.lesson.elements.select{|el| el.name == "name"}; end }
x.report("search") { n.times do; doc.lesson.search('name'); end }
end
Which gives us the result:
#=> user system total real
#=> select 1.466000 0.047000 1.513000 ( 1.528153)
#=> search 2.637000 0.125000 2.762000 ( 2.777278)

You can use search and give the node a xpath or css selector:
doc.lesson.search('name').first

Do a bit hack using meta programming.
require 'nokogiri'
doc = Nokogiri::Slop <<-HTML
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
HTML
name_val = doc.lesson.instance_eval do
self.class.send :undef_method, :name
self.name
end.text
p name_val # => toto
p doc.lesson.version.text # => '42'
Nokogiri::XML::Node#name is a method defined to get the names of Nokogiri::XML::Node. Just for some moment, remove the method from the class Nokogiri::XML::Node in the scope of #instance_eval.

Related

Ruby String to formatted JSON

I've scrapped the part of the data from the pages with Nokogiri .
require 'net/http'
require 'nokogiri'
require 'open-uri'
require 'json'
sources = {
cb: "http://www.cbbankmm.com/fxratesho.php",
}
puts "Currencies from CB Bank are"
if #page = Nokogiri::HTML(open(sources[:cb]))
(1..3).each do |i|
puts #page.css("tr")[i].text.gsub(/\s+/,'')
end
end
The result is
Currencies from CB Bank are
USD873883
SGD706715
EURO11241135
I would like to format the output to the below JSON format
{
"bank":"CB",
"rates": {
"USD":"[873,883]",
"SGD":"[706,715]",
"EURO":"[1124,1135]"
}
}
Which gems, method do I have to use to get the above Hash or JSON format?
Some abstraction might be an idea. So, perhaps a class to help you with the job:
class Currencies
def initialize(page, bank)
#page = page
#bank = bank
end
def parsed
#parsed ||= #page.css("tr").collect{ |el| el.text.gsub(/\s+/,'') }
end
def to_hash
{
bank: #bank,
rates: {
USD: usd,
SGD: sgd,
....
}
}
end
def usd
parsed[0].gsub(/^USD/, '')
end
def sgd
parsed[1].gsub(/^SGD/, '')
end
...
end
Use it like this
Currencies.new(Nokogiri::HTML(open(sources[:cb])), "CB").to_hash.to_json
Just make an equivalent hash structure in Ruby, and do e.g.
hash = {
"bank" => "CB",
"rates" => {
"USD" => "[873,883]",
"SGD" => "[706,715]",
"EURO" => "[1124,1135]"
}
}
hash.to_json
You are already including the json gem. Obviously you build the Ruby hash up in places where you currently have puts statements.
Edit: If the layout is important to you, you may prefer:
JSON.pretty_generate( hash )

Data scraping with Nokogiri

I am able to scrape http://www.example.com/view-books/0/new-releases using Nokogiri but how do I scrape all the pages? This one has five pages, but without knowing the last page how do I proceed?
This is the program that I wrote:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
urls=Array['http://www.example.com/view-books/0/new-releases?layout=grid&_pop=flyout',
'http://www.example.com/view-books/1/bestsellers',
'http://www.example.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253'
]
#titles=Array.new
#prices=Array.new
#descriptions=Array.new
#page=Array.new
urls.each do |url|
doc=Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css('.fk-inf-scroll-item').each do |item|
#prices << item.at_css(".final-price").text
#titles << item.at_css(".fk-srch-title-text").text
#descriptions << item.at_css(".fk-item-specs-section").text
#page << item.at_css(".fk-inf-pageno").text rescue nil
end
(0..#prices.length - 1).each do |index|
puts "title: #{#titles[index]}"
puts "price: #{#prices[index]}"
puts "description: #{#descriptions[index]}"
# puts "pageno. : #{#page[index]}"
puts ""
end
end
CSV.open("result.csv", "wb") do |row|
row << ["title", "price", "description","pageno"]
(0..#prices.length - 1).each do |index|
row << [#titles[index], #prices[index], #descriptions[index],#page[index]]
end
end
As you can see I have hardcoded the URLs. How do you suggest that I scrape the entire books category? I was trying anemone but couldn't get it to work.
If you inspect what exactly happens when you load more results, you will realise that they are actually using a JSON to read the info with an offset.
So, you can get the five pages like this :
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=0
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=20
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=40
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=60
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=80
Basically you keep incrementing inf-start and get the results until you get the result-set less than 20 which should be your last page.
Here's an untested sample of code to do what yours is, only written a bit more concisely:
require 'nokogiri'
require 'open-uri'
require 'csv'
urls = %w[
http://www.flipkart.com/view-books/0/new-releases?layout=grid&_pop=flyout
http://www.flipkart.com/view-books/1/bestsellers
http://www.flipkart.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253
]
CSV.open('result.csv', 'wb') do |row|
row << ['title', 'price', 'description', 'pageno']
urls.each do |url|
doc = Nokogiri::HTML(open(url))
puts doc.at_css('title').text
doc.css('.fk-inf-scroll-item').each do |item|
page = {
titles: item.at_css('.fk-srch-title-text').text,
prices: item.at_css('.final-price').text,
descriptions: item.at_css('.fk-item-specs-section').text,
pageno: item.at_css('.fk-inf-pageno').text rescue nil,
}
page.each do |k, v|
puts '%s: %s' % [k.to_s, v]
end
row << page.values
end
end
end
There are some useful pieces of data you can use to help you figure out how many records you need to retrieve:
var config = {container: "#search_results", page_size: 20, counterSelector: ".fk-item-count", totalResults: 88, "startParamName" : "inf-start", "startFrom": 20};
To access the values use something like:
doc.at('script[type="text/javascript+fk-onload"]').text =~ /page_size: (\d+).+totalResults: (\d+).+"startFrom": (\d+)/
page_size, total_results, start_from = $1, $2, $3

How would I parse this XML with Ruby?

Currently, I have an XML document (called Food_Display_Table.xml) with data in a format like this:
<Food_Display_Table>
<Food_Display_Row>
<Food_Code>12350000</Food_Code>
<Display_Name>Sour cream dip</Display_Name>
....
<Solid_Fats>105.64850</Solid_Fats>
<Added_Sugars>1.57001</Added_Sugars>
<Alcohol>.00000</Alcohol>
<Calories>133.65000</Calories>
<Saturated_Fats>7.36898</Saturated_Fats>
</Food_Display_Row>
...
</Food_Display_Table>
I would like to print some of this information in human readable format. Like this:
-----
Sour cream dip
Calories: 133.65000
Saturated Fats: 7.36898
-----
So far, I have tried this, but it doesn't work:
require 'rexml/document'
include REXML
data = Document.new File.new("Food_Display_Table.xml", "r")
data.elements.each("*/*/*") do |foodcode, displayname, portiondefault, portionamount, portiondisplayname, factor, increments, multiplier, grains, wholegrains, orangevegetables, darkgreenvegetables, starchyvegetables, othervegetables, fruits, milk, meats, soy, drybeans, oils, solidfats, addedsugars, alcohol, calories, saturatedfats|
puts "----"
puts displayname
puts "Calories: {calories}"
puts "Saturated Fats: {saturatedfats}"
puts "----"
end
Use Xpath. I tend to go with Nokogiri as I prefer the API.
With the paths hard-coded:
doc = Nokogiri::XML(xml_string)
doc.xpath(".//Food_Display_Row").each do |node|
puts "-"*5
puts "Name: #{node.xpath('.//Display_Name').text}"
puts "Calories: #{node.xpath('.//Calories').text}"
puts "Saturated Fats: #{node.xpath('.//Saturated_Fats').text}"
puts "-"*5
end
or for something a bit DRYer.
nodes_to_display = ["Display_Name", "Calories", "Saturated_Fats"]
doc = Nokogiri::XML(xml_string)
doc.xpath(".//Food_Display_Row").each do |node|
nodes_to_display.each do |node_name|
if value = node.at_xpath(".//#{node_name}")
puts "#{node_name}: #{value.text}"
end
end
end
I'd do it like this, with Nokogiri:
require 'nokogiri' # gem install nokogiri
doc = Nokogiri::XML(IO.read('Food_Display_Table.xml'))
good_fields = %w[ Calories Saturated_Fats ]
puts "-"*5
doc.search("Food_Display_Row").each do |node|
puts node.at('Display_Name').text
node.search(*good_fields).each do |node|
puts "#{node.name.gsub('_',' ')}: #{node.text}"
end
puts "-"*5
end
If I had to use REXML (which I used to love, but now love Nokogiri more), the following works:
require 'rexml/document'
doc = REXML::Document.new( IO.read('Food_Display_Table.xml') )
separator = "-"*15
puts separator
desired = %w[ Calories Saturated_Fats ]
doc.root.elements.each do |row|
puts REXML::XPath.first( row, 'Display_Name' ).text
desired.each do |node_name|
REXML::XPath.each( row, node_name ) do |node|
puts "#{node_name.gsub('_',' ')}: #{node.text}"
end
end
puts separator
end
#=> ---------------
#=> Sour cream dip
#=> Calories: 133.65000
#=> Saturated Fats: 7.36898
#=> ---------------

xml to querystring

What is the best option to generate a querystring (url params) from an XML document in Ruby ?
xml_string = <abc><session>1234</session><description>some_description</description></abc>
query_string = # I want here "?abc=session......."
xml_string = "<abc><session>1234</session><description>some_description</description></abc>"
result = "?"+Hash.from_xml(xml_string).to_query
XmlSimple is a nice gem.
require 'rubygems'
require 'xmlsimple'
data = XmlSimple.xml_in(xml_string)
url_params = xml_to_url_params(data, "abc")
def xml_to_url_params(xml_data, root)
elements = []
data[root].each do |item|
item.each do |name, value|
elements << "#{CGI::escape(name)}=#{CGI::escape(value)}"
end
end
elements.join("&")
end
ps. did not tested this code, so there can be bugs ;)

One-liner to Convert Nested Hashes into dot-separated Strings in Ruby?

What's the simplest method to convert YAML to dot-separated strings in Ruby?
So this:
root:
child_a: Hello
child_b:
nested_child_a: Nesting
nested_child_b: Nesting Again
child_c: K
To this:
{
"ROOT.CHILD_A" => "Hello",
"ROOT.CHILD_B.NESTED_CHILD_A" => "Nesting",
"ROOT.CHILD_B.NESTED_CHILD_B" => "Nesting Again",
"ROOT.CHILD_C" => "K"
}
It's not a one-liner, but perhaps it will fit your needs
def to_dotted_hash(source, target = {}, namespace = nil)
prefix = "#{namespace}." if namespace
case source
when Hash
source.each do |key, value|
to_dotted_hash(value, target, "#{prefix}#{key}")
end
when Array
source.each_with_index do |value, index|
to_dotted_hash(value, target, "#{prefix}#{index}")
end
else
target[namespace] = source
end
target
end
require 'pp'
require 'yaml'
data = YAML.load(DATA)
pp data
pp to_dotted_hash(data)
__END__
root:
child_a: Hello
child_b:
nested_child_a: Nesting
nested_child_b: Nesting Again
child_c: K
prints
{"root"=>
{"child_a"=>"Hello",
"child_b"=>{"nested_child_a"=>"Nesting", "nested_child_b"=>"Nesting Again"},
"child_c"=>"K"}}
{"root.child_c"=>"K",
"root.child_b.nested_child_a"=>"Nesting",
"root.child_b.nested_child_b"=>"Nesting Again",
"root.child_a"=>"Hello"}

Resources