How would I parse this XML with Ruby? - ruby

Currently, I have an XML document (called Food_Display_Table.xml) with data in a format like this:
<Food_Display_Table>
<Food_Display_Row>
<Food_Code>12350000</Food_Code>
<Display_Name>Sour cream dip</Display_Name>
....
<Solid_Fats>105.64850</Solid_Fats>
<Added_Sugars>1.57001</Added_Sugars>
<Alcohol>.00000</Alcohol>
<Calories>133.65000</Calories>
<Saturated_Fats>7.36898</Saturated_Fats>
</Food_Display_Row>
...
</Food_Display_Table>
I would like to print some of this information in human readable format. Like this:
-----
Sour cream dip
Calories: 133.65000
Saturated Fats: 7.36898
-----
So far, I have tried this, but it doesn't work:
require 'rexml/document'
include REXML
data = Document.new File.new("Food_Display_Table.xml", "r")
data.elements.each("*/*/*") do |foodcode, displayname, portiondefault, portionamount, portiondisplayname, factor, increments, multiplier, grains, wholegrains, orangevegetables, darkgreenvegetables, starchyvegetables, othervegetables, fruits, milk, meats, soy, drybeans, oils, solidfats, addedsugars, alcohol, calories, saturatedfats|
puts "----"
puts displayname
puts "Calories: {calories}"
puts "Saturated Fats: {saturatedfats}"
puts "----"
end

Use Xpath. I tend to go with Nokogiri as I prefer the API.
With the paths hard-coded:
doc = Nokogiri::XML(xml_string)
doc.xpath(".//Food_Display_Row").each do |node|
puts "-"*5
puts "Name: #{node.xpath('.//Display_Name').text}"
puts "Calories: #{node.xpath('.//Calories').text}"
puts "Saturated Fats: #{node.xpath('.//Saturated_Fats').text}"
puts "-"*5
end
or for something a bit DRYer.
nodes_to_display = ["Display_Name", "Calories", "Saturated_Fats"]
doc = Nokogiri::XML(xml_string)
doc.xpath(".//Food_Display_Row").each do |node|
nodes_to_display.each do |node_name|
if value = node.at_xpath(".//#{node_name}")
puts "#{node_name}: #{value.text}"
end
end
end

I'd do it like this, with Nokogiri:
require 'nokogiri' # gem install nokogiri
doc = Nokogiri::XML(IO.read('Food_Display_Table.xml'))
good_fields = %w[ Calories Saturated_Fats ]
puts "-"*5
doc.search("Food_Display_Row").each do |node|
puts node.at('Display_Name').text
node.search(*good_fields).each do |node|
puts "#{node.name.gsub('_',' ')}: #{node.text}"
end
puts "-"*5
end
If I had to use REXML (which I used to love, but now love Nokogiri more), the following works:
require 'rexml/document'
doc = REXML::Document.new( IO.read('Food_Display_Table.xml') )
separator = "-"*15
puts separator
desired = %w[ Calories Saturated_Fats ]
doc.root.elements.each do |row|
puts REXML::XPath.first( row, 'Display_Name' ).text
desired.each do |node_name|
REXML::XPath.each( row, node_name ) do |node|
puts "#{node_name.gsub('_',' ')}: #{node.text}"
end
end
puts separator
end
#=> ---------------
#=> Sour cream dip
#=> Calories: 133.65000
#=> Saturated Fats: 7.36898
#=> ---------------

Related

Ruby - Capitalize a title using map and capitalize methods

I'm working through The Odin Projects Ruby basics and completely stuck on 05_book_titles.
The title needs to be capitalized, including the 1st word but not including "small words" (ie "to", "the", etc) UNLESS it's the 1st word.
I can't get the code to do anything besides capitalize everything. Am I misusing map method? How can I get it to include the no_cap words in the returned title without capitalizing?
The Ruby file:
class Book
def title
#title
end
def title=(title)
no_cap = ["if", "or", "in", "a", "and", "the", "of", "to"]
p new_title = #title.split(" ")
p new_new_title = new_title.map{|i| i.capitalize if !no_cap.include? i}
.join(" ")
end
end
Some of the Spec file:
require 'book'
describe Book do
before do
#book = Book.new
end
describe 'title' do
it 'should capitalize the first letter' do
#book.title = "inferno"
expect(#book.title).to eq("Inferno")
end
it 'should capitalize every word' do
#book.title = "stuart little"
expect(#book.title).to eq("Stuart Little")
end
describe 'should capitalize every word except...' do
describe 'articles' do
specify 'the' do
#book.title = "alexander the great"
expect(#book.title).to eq("Alexander the Great")
end
specify 'a' do
#book.title = "to kill a mockingbird"
expect(#book.title).to eq("To Kill a Mockingbird")
end
specify 'an' do
#book.title = "to eat an apple a day"
expect(#book.title).to eq("To Eat an Apple a Day")
end
end
specify 'conjunctions' do
#book.title = "war and peace"
expect(#book.title).to eq("War and Peace")
end
end
end
end
Result:
Book
title
should capitalize the first letter (FAILED - 1)
Failures:
1) Book title should capitalize the first letter
Failure/Error: #book.title = "inferno"
NoMethodError:
undefined method `split' for nil:NilClass
# ./05_book_titles/book.rb:8:in `title='
# ./05_book_titles/book_titles_spec.rb:25:in `block (3 levels) in <top (required)>'
Finished in 0.0015 seconds (files took 0.28653 seconds to load)
1 example, 1 failure
Failed examples:
rspec ./05_book_titles/book_titles_spec.rb:24 # Book title should capitalize the first letter
You are using #title before it's assigned in
new_title = #title.split(" ")
It should be changed to title.
You don't assign the calculated title to #title at the end of the title= method.
You also need to add 'an' to no_cap in order to pass the spec using "to eat an apple a day" as title.
And take care of the first word:
class Book
def title
#title
end
def title=(title)
no_cap = ["if", "or", "in", "a", "and", 'an', "the", "of", "to"]
new_title = title.split(' ').each_with_index.map do |x, i|
unless i != 0 && no_cap.include?(x)
x.capitalize
else
x
end
end
#title = new_title.join(' ')
end
end
small_words = ["if", "or", "in", "a", "and", "the", "of", "to"]
str = "tO be Or Not to be."
str.gsub(/\p{Alpha}+/) { |s| Regexp.last_match.begin(0) > 0 &&
small_words.include?(s.downcase) ? s.downcase : s.capitalize }
#=> "To Be or Not to Be."

Find the name and age of the oldest person in a txt file using ruby

"Attached is a file with people's names and ages.
There will always be a First name and Last name followed by a colon then the age.
So each line with look something like this.
FirstName LastName: Age
Your job is write a ruby program that can read this file and figure out who the oldest person/people are on this list. Your program should print out their name(s) and age(s)."
This is the code I have so far:
File.open('nameage.txt') do |f|
f.each_line do |line|
line.split(":").last.to_i
puts line.split(":").last.to_i
end
end
With this, I am able to separate the name from the age but I don't know how to get the highest value and print out the highest value with name and age.
Please help!
"figure out who the oldest person/people are on this list", so multiple results are possible. Ruby has a group_by method, which groups an enumerable by a common property. What property? The property you specify in the block.
grouped = File.open('nameage.txt') do |f|
f.group_by do |line|
line.split(":").last.to_i # using OP's line
end
end
p grouped # just to see what it looks like
puts grouped.max.last # end result
You could push all the ages into an array. Do array.max or sort the array and do array[-1].
Here's how I would approach it:
oldest_name = nil
oldest_age = 0
For each line in file do
split line at the colon and store the age inside age variable
split line at the colon and store the age inside name variable
if age is greater than oldest_age then
oldest_age = age
oldest_name = name
end
end
finally print the oldest_name and oldest_age
If you're in to one-liners try this
$ cat nameage.txt
John Doe: 34
Tom Jones: 50
Jane Doe: 32
Citizen Kane: 29
$ irb
1.9.3-p551 :001 > IO.read("nameage.txt").split("\n").sort_by { |a| a.split(":")[1].to_i }.last
=> "Tom Jones: 50"
You can try using hash also,
hash = {}
File.open('nameage.txt') do |f|
f.each_line do |line|
data = line.split(":")
hash[data.first] = data.last.strip
end
hash.max_by{|k,v| v}.join(" : ")
end
File.open('nameage.txt') do |handle|
people = handle.each_line.map { |line| line.split(":") }
oldest_age = people.map { |_, age| age.to_i }.max
people.select { |_, age| age.to_i == oldest_age }.each do |name, age|
puts "#{name}, #{age}"
end
end
You are going the right way. Now you just need to store the right things in the right places. I just merged your code and the code proposed by #oystersauce14.
oldest_name = nil
oldest_age = 0
File.open('nameage.txt') do |f|
f.each_line do |line|
data = line.split(":")
curr_name = data[0]
curr_age = data[1].strip.to_i
if (curr_age > oldest_age) then
oldest_name = curr_name
oldest_age = curr_age
end
end
end
puts "The oldest person is #{oldest_name} and he/she is #{oldest_age} years old."
Notice the use of String#strip when acquiring the age. According to the format of the file, this piece of data (the age) has a space before the first number and you need to strip this before converting it using String#to_i.
EDIT:
Since you may have more than one person with the maximum age in the list, you may do it in two passes:
oldest_age = 0
File.open('nameage.txt') do |f|
f.each_line do |line|
curr_age = line.split(":")[1].strip.to_i
if (curr_age > oldest_age) then
oldest_age = curr_age
end
end
end
oldest_people = Array.new
File.open('nameage.txt') do |f|
f.each_line do |line|
data = line.split(":")
curr_name = data[0]
curr_age = data[1].strip.to_i
oldest_people.push(curr_name) if (curr_age == oldest_age)
end
end
oldest_people.each { |person| p "#{person} is #{oldest_age}" }
I believe that now this will give you what you need.

Nokogiri: Slop access a node named name

I'm trying to parse a xml that looks like this:
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
Using Nokogiri::Slop.
I can access lesson easily through lesson.version but I cannot access lesson.name, as name refer in this case to the name of the node (lesson).
Is there any way to access the child ?
As a variant you could try this one:
doc.lesson.elements.select{|el| el.name == "name"}
Why? Just because of this benchmarks:
require 'nokogiri'
require 'benchmark'
str = '<lesson>
<name>toto</name>
<version>42</version>
</lesson>'
doc = Nokogiri::Slop(str)
n = 50000
Benchmark.bm do |x|
x.report("select") { n.times do; doc.lesson.elements.select{|el| el.name == "name"}; end }
x.report("search") { n.times do; doc.lesson.search('name'); end }
end
Which gives us the result:
#=> user system total real
#=> select 1.466000 0.047000 1.513000 ( 1.528153)
#=> search 2.637000 0.125000 2.762000 ( 2.777278)
You can use search and give the node a xpath or css selector:
doc.lesson.search('name').first
Do a bit hack using meta programming.
require 'nokogiri'
doc = Nokogiri::Slop <<-HTML
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
HTML
name_val = doc.lesson.instance_eval do
self.class.send :undef_method, :name
self.name
end.text
p name_val # => toto
p doc.lesson.version.text # => '42'
Nokogiri::XML::Node#name is a method defined to get the names of Nokogiri::XML::Node. Just for some moment, remove the method from the class Nokogiri::XML::Node in the scope of #instance_eval.

Ruby: Reading contents of a xls file and getting each cells information

This is the link of a XLS file. I am trying to use Spreadsheet gem to extract the contents of the XLS file. In particular, I want to collect all the column headers like (Year, Gross National Product etc.). But, the issue is they are not in the same row. For example, Gross National Income comprised of three rows. I also want to know how many row cells are merged to make the cell 'Year'.
I have started writing the program and I am upto this:
require 'rubygems'
require 'open-uri'
require 'spreadsheet'
rows = Array.new
url = 'http://www.stats.gov.cn/tjsj/ndsj/2012/html/C0201e.xls'
doc = Spreadsheet.open (open(url))
sheet1 = doc.worksheet 0
sheet1.each do |row|
if row.is_a? Spreadsheet::Formula
# puts row.value
rows << row.value
else
# puts row
rows << row
end
# puts row.value
end
But, now I am stuck and really need some guideline to proceed. Any kind of help is well appreciated.
require 'rubygems'
require 'open-uri'
require 'spreadsheet'
rows = Array.new
temp_rows = Array.new
column_headers = Array.new
index = 0
url = 'http://www.stats.gov.cn/tjsj/ndsj/2012/html/C0201e.xls'
doc = Spreadsheet.open (open(url))
sheet1 = doc.worksheet 0
sheet1.each do |row|
rows << row.to_a
end
rows.each_with_index do |row,ind|
if row[0]=="Year"
index = ind
break
end
end
(index..7).each do |i|
# puts rows[i].inspect
if rows[i][0] =~ /[0-9]/
break
else
temp_rows << rows[i]
end
end
col_size = temp_rows[0].size
# puts temp_rows.inspect
col_size.times do |c|
temp_str = ""
temp_rows.each do |row|
temp_str +=' '+ row[c] unless row[c].nil?
end
# puts temp_str.inspect
column_headers << temp_str unless temp_str.nil?
end
puts 'Column Headers of this xls file are : '
# puts column_headers.inspect
column_headers.each do |col|
puts col.strip.inspect if col.length >1
end

Data scraping with Nokogiri

I am able to scrape http://www.example.com/view-books/0/new-releases using Nokogiri but how do I scrape all the pages? This one has five pages, but without knowing the last page how do I proceed?
This is the program that I wrote:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
urls=Array['http://www.example.com/view-books/0/new-releases?layout=grid&_pop=flyout',
'http://www.example.com/view-books/1/bestsellers',
'http://www.example.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253'
]
#titles=Array.new
#prices=Array.new
#descriptions=Array.new
#page=Array.new
urls.each do |url|
doc=Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css('.fk-inf-scroll-item').each do |item|
#prices << item.at_css(".final-price").text
#titles << item.at_css(".fk-srch-title-text").text
#descriptions << item.at_css(".fk-item-specs-section").text
#page << item.at_css(".fk-inf-pageno").text rescue nil
end
(0..#prices.length - 1).each do |index|
puts "title: #{#titles[index]}"
puts "price: #{#prices[index]}"
puts "description: #{#descriptions[index]}"
# puts "pageno. : #{#page[index]}"
puts ""
end
end
CSV.open("result.csv", "wb") do |row|
row << ["title", "price", "description","pageno"]
(0..#prices.length - 1).each do |index|
row << [#titles[index], #prices[index], #descriptions[index],#page[index]]
end
end
As you can see I have hardcoded the URLs. How do you suggest that I scrape the entire books category? I was trying anemone but couldn't get it to work.
If you inspect what exactly happens when you load more results, you will realise that they are actually using a JSON to read the info with an offset.
So, you can get the five pages like this :
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=0
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=20
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=40
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=60
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=80
Basically you keep incrementing inf-start and get the results until you get the result-set less than 20 which should be your last page.
Here's an untested sample of code to do what yours is, only written a bit more concisely:
require 'nokogiri'
require 'open-uri'
require 'csv'
urls = %w[
http://www.flipkart.com/view-books/0/new-releases?layout=grid&_pop=flyout
http://www.flipkart.com/view-books/1/bestsellers
http://www.flipkart.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253
]
CSV.open('result.csv', 'wb') do |row|
row << ['title', 'price', 'description', 'pageno']
urls.each do |url|
doc = Nokogiri::HTML(open(url))
puts doc.at_css('title').text
doc.css('.fk-inf-scroll-item').each do |item|
page = {
titles: item.at_css('.fk-srch-title-text').text,
prices: item.at_css('.final-price').text,
descriptions: item.at_css('.fk-item-specs-section').text,
pageno: item.at_css('.fk-inf-pageno').text rescue nil,
}
page.each do |k, v|
puts '%s: %s' % [k.to_s, v]
end
row << page.values
end
end
end
There are some useful pieces of data you can use to help you figure out how many records you need to retrieve:
var config = {container: "#search_results", page_size: 20, counterSelector: ".fk-item-count", totalResults: 88, "startParamName" : "inf-start", "startFrom": 20};
To access the values use something like:
doc.at('script[type="text/javascript+fk-onload"]').text =~ /page_size: (\d+).+totalResults: (\d+).+"startFrom": (\d+)/
page_size, total_results, start_from = $1, $2, $3

Resources