Manipulating XML files in ruby with XmlSimple - ruby

I've got a complex XML file, and I want to extract a content of a specific tag from it.
I use a ruby script with XmlSimple gem. I retrieve an XML file with HTTP request, then strip all the unnecessary tags and pull out necessary info. That's the script itself:
data = XmlSimple.xml_in(response.body)
hash_1 = Hash[*data['results']]
def find_value(hash, value)
hash.each do |key, val|
if val[0].kind_of? Hash then
find_value(val[0], value)
else
if key.to_s.eql? value
puts val
end
end
end
end
hash_1['book'].each do |arg|
find_value(arg, "title")
puts("\n")
end
The problem is, that when I change replace puts val with return val, and then call find_value method with puts find_value (arg, "title"), i get the whole contents of hash_1[book] on the screen.
How to correct the find_value method?

A "complex XML file" and XmlSimple don't mix. Your task would be solved a lot easier with Nokogiri, and be faster as well:
require 'nokogiri'
doc = Nokogiri::XML(response.body)
puts doc.xpath('//book/title/text()')

Related

How do I scrape a website and output data to xml file with Nokogiri?

I've been trying to scrape data using Nokogiri and HTTParty and can scrape data off a website successfully and print it to the console but I can't work out how to output the data to an xml file in the repo.
Right now the code looks like this:
class Scraper
attr_accessor :parse_page
def initialize
doc = HTTParty.get("https://store.nike.com/gb/en_gb/pw/mens-nikeid-lifestyle-shoes/1k9Z7puZoneZoi3?ref=https%253A%252F%252Fwww.google.com%252F")
#parse_page ||= Nokogiri::HTML(doc)
end
def get_names
item_container.css(".product-display-name").css("p").children.map { |name| name.text }.compact
end
def get_prices
item_container.css(".product-price").css("span.local").children.map { |price| price.text }.compact
end
private
def item_container
parse_page.css(".grid-item-info")
end
scraper = Scraper.new
names = scraper.get_names
prices = scraper.get_prices
(0...prices.size).each do |index|
puts " - - - Index #{index + 1} - - -"
puts "Name: #{names[index]} | Price: #{prices[index]}"
end
end
I've tried changing the .each method to include a File.write() but all it ever does is write the last line of the output into the xml file. I would appreciate any insight as to how to parse the data correctly, I am new to scraping.
I've tried changing the .each method to include a File.write() but all it ever does is write the last line of the output into the xml file.
Is the File.write method inside the each loop? I guess what's happening here is You are overwriting the file on every iteration and that's why you are seeing only the last line.
Try putting the each loop inside the block of the File.open method like:
File.open(yourfile, 'w') do |file|
(0...prices.size).each do |index|
file.write("your text")
end
end
I also recommend reading about the Nokogiri::XML::Builder and then saving it's output to the file.

How to parse CSON to Ruby object?

I am trying to read CSON (CoffeeScript Object Notation) into Ruby.
I am looking for something similar to data = JSON.parse(file) that one would use for JSON files.
file = File.read(filename)
data = CSON.parse(file) # does not exist - would like to have
I looked into invoking CoffeeScript and JavaScript from Ruby, but it feels overly complicated and like reinventing the wheel. Also, code in the data file should not be executed.
How can I read CSON into Ruby objects in a simple way?
This is what I came up with. It is sufficient for the data I am processing. The main work is done with the YAML parser Psych (https://github.com/ruby/psych). Arrays, hashes, and some of the multi-line text require a special treatment.
module CSON
def load_file(fname)
load_string File.read fname
end
def remove_indent(data)
out = ""
data.each_line do |line|
out += line.sub /^\s\s/,""
end
out
end
def parse_array(data)
data.gsub! /\n/, ","
data.gsub! /([\[\{]),/, '\1'
data.gsub! /,([\]\}])/, '\1'
YAML.load data
end
def load_string(data)
hashed = {}
data.gsub! /^(\w+):\s+(\[.*?\])/mu do # find arrays
key = Regexp.last_match[1]
value = parse_array Regexp.last_match[2]
hashed[key] = value
""
end
data.gsub! /(\w+):\s+\'\'\'\s*\n(.*?)\'\'\'/mu do # find heredocs
hashed[Regexp.last_match[1]] = remove_indent Regexp.last_match[2]
""
end
hashed.merge YAML.load data
end
end
This solution is likely to fail when applied to more complicated .cson files. I would be happy to see if someone has a more elegant answer!

How do I test reading a file?

I'm writing a test for one of my classes which has the following constructor:
def initialize(filepath)
#transactions = []
File.open(filepath).each do |line|
next if $. == 1
elements = line.split(/\t/).map { |e| e.strip }
transaction = Transaction.new(elements[0], Integer(1))
#transactions << transaction
end
end
I'd like to test this by using a fake file, not a fixture. So I wrote the following spec:
it "should read a file and create transactions" do
filepath = "path/to/file"
mock_file = double(File)
expect(File).to receive(:open).with(filepath).and_return(mock_file)
expect(mock_file).to receive(:each).with(no_args()).and_yield("phrase\tvalue\n").and_yield("yo\t2\n")
filereader = FileReader.new(filepath)
filereader.transactions.should_not be_nil
end
Unfortunately this fails because I'm relying on $. to equal 1 and increment on every line and for some reason that doesn't happen during the test. How can I ensure that it does?
Global variables make code hard to test. You could use each_with_index:
File.open(filepath) do |file|
file.each_with_index do |line, index|
next if index == 0 # zero based
# ...
end
end
But it looks like you're parsing a CSV file with a header line. Therefore I'd use Ruby's CSV library:
require 'csv'
CSV.foreach(filepath, col_sep: "\t", headers: true, converters: :numeric) do |row|
#transactions << Transaction.new(row['phrase'], row['value'])
end
You can (and should) use IO#each_line together with Enumerable#each_with_index which will look like:
File.open(filepath).each_line.each_with_index do |line, i|
next if i == 1
# …
end
Or you can drop the first line, and work with others:
File.open(filepath).each_line.drop(1).each do |line|
# …
end
If you don't want to mess around with mocking File for each test you can try FakeFS which implements an in memory file system based on StringIO that will clean up automatically after your tests.
This way your test's don't need to change if your implementation changes.
require 'fakefs/spec_helpers'
describe "FileReader" do
include FakeFS::SpecHelpers
def stub_file file, content
FileUtils.mkdir_p File.dirname(file)
File.open( file, 'w' ){|f| f.write( content ); }
end
it "should read a file and create transactions" do
file_path = "path/to/file"
stub_file file_path, "phrase\tvalue\nyo\t2\n"
filereader = FileReader.new(file_path)
expect( filereader.transactions ).to_not be_nil
end
end
Be warned: this is an implementation of most of the file access in Ruby, passing it back onto the original method where possible. If you are doing anything advanced with files you may start running into bugs in the FakeFS implementation. I got stuck with some binary file byte read/write operations which weren't implemented in FakeFS quite how Ruby implemented them.

How do I traverse an inner node using SAX in Nokogiri?

I'm quite new to Nokogiri and Ruby and seeking a little help.
I am parsing a very large XML file using class MyDoc < Nokogiri::XML::SAX::Document. Now I want to traverse the inner part of a block.
Here's the format of my XML file:
<Content id="83087">
<Title></Title>
<PublisherEntity id="1067">eBooksLib</PublisherEntity>
<Publisher>eBooksLib</Publisher>
......
</Content>
I can already tell if the "Content" tag is found, now I want to know how to traverse inside of it. Here's my shortened code:
class MyDoc < Nokogiri::XML::SAX::Document
#check the start element. set flag for each element
def start_element name, attrs = []
if(name == 'Content')
#get the <Title>
#get the <PublisherEntity>
#get the Publisher
end
end
def cdata_block(string)
characters(string)
end
def characters(str)
puts str
end
end
Purists may disagree with me, but the way I've been doing it is to use Nokogiri to traverse the huge file, and then use XmlSimple to work with a smaller object in the file. Here's a snippet of my code:
require 'nokogiri'
require 'xmlsimple'
def isend(node)
return (node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT)
end
reader = Nokogiri::XML::Reader(File.open('database.xml', 'r'))
# traverse the file looking for tag "content"
reader.each do |node|
next if node.name != 'content' || isend(node)
# if we get here, then we found start of node 'content',
# so read it into an array and work with the array:
content = XmlSimple.xml_in(node.outer_xml())
title = content['title'][0]
# ...etc.
end
This works very well for me. Some may object to mixing SAX and non-SAX (nokogiri and XmlSimple) in the same code, but for my purposes, it gets the job done with minimal hassle.
It's trickier to do with SAX. I think the solution will need to look something like this:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
#inside_content = true if name == 'Content'
#current_element = name
end
def end_element name
#inside_content = false if name == 'Content'
#current_element = nil
end
def characters str
puts "#{#current_element} - #{str}" if #inside_content && %w{Title PublisherEntity Publisher}.include?(#current_element)
end
end

How to search an XML when parsing it using SAX in nokogiri

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title tag?
You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
#in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
#in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "#in_title = false if element == 'title'"
#in_title = false
end
def characters(text)
puts text if #in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title> tag.

Resources