converting from xml name-values into simple hash - ruby

I don't know what name this goes by and that's been complicating my search.
My data file OX.session.xml is in the (old?) form
<?xml version="1.0" encoding="utf-8"?>
<CAppLogin xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://oxbranch.optionsxpress.com">
<SessionID>FE5E27A056944FBFBEF047F2B99E0BF6</SessionID>
<AccountNum>8228-5500</AccountNum>
<AccountID>967454</AccountID>
</CAppLogin>
What is that XML data format called exactly?
Anyway, all I want is to end up with one hash in my Ruby code like so:
CAppLogin = { :SessionID => "FE5E27A056944FBFBEF047F2B99E0BF6", :AccountNum => "8228-5500", etc. } # Doesn't have to be called CAppLogin as in the file, may be fixed
What might be shortest, most built-in Ruby way to automate that hash read, in a way I can update the SessionID value and store it easily back into the file for later program runs?
I've played around with YAML, REXML but would rather not yet print my (bad) example trials.

There are a few libraries you can use in Ruby to do this.
Ruby toolbox has some good coverage of a few of them:
https://www.ruby-toolbox.com/categories/xml_mapping
I use XMLSimple, just require the gem then load in your xml file using xml_in:
require 'xmlsimple'
hash = XmlSimple.xml_in('session.xml')
If you're in a Rails environment, you can just use Active Support:
require 'active_support'
session = Hash.from_xml('session.xml')

Using Nokogiri to parse the XML with namespaces:
require 'nokogiri'
dom = Nokogiri::XML(File.read('OX.session.xml'))
node = dom.xpath('ox:CAppLogin',
'ox' => "http://oxbranch.optionsxpress.com").first
hash = node.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.to_sym] = e.content
end
puts hash.inspect
# {:SessionID=>"FE5E27A056944FBFBEF047F2B99E0BF6",
# :AccountNum=>"8228-5500", :AccountID=>"967454"}
If you know that the CAppLogin is the root element, you can simplify a bit:
require 'nokogiri'
dom = Nokogiri::XML(File.read('OX.session.xml'))
hash = dom.root.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.to_sym] = e.content
end
puts hash.inspect
# {:SessionID=>"FE5E27A056944FBFBEF047F2B99E0BF6",
# :AccountNum=>"8228-5500", :AccountID=>"967454"}

Related

Create a Ruby Hash out of an xml string with the 'ox' gem

I am currently trying to create a hash out of an xml documen, with the help of the ox gem
Input xml:
<?xml version="1.0"?>
<expense>
<payee>starbucks</payee>
<amount>5.75</amount>
<date>2017-06-10</date>
</expense>
with the following ruby/ox code:
doc = Ox.parse(xml)
plist = doc.root.nodes
I get the following output:
=> [#<Ox::Element:0x00007f80d985a668 #value="payee", #attributes={}, #nodes=["starbucks"]>, #<Ox::Element:0x00007f80d9839198 #value="amount", #attributes={}, #nodes=["5.75"]>, #<Ox::Element:0x00007f80d9028788 #value="date", #attributes={}, #nodes=["2017-06-10"]>]
The output I want is a hash in the format:
{'payee' => 'Starbucks',
'amount' => 5.75,
'date' => '2017-06-10'}
to save in my sqllite database. How can I transform the objects array into a hash like above.
Any help is highly appreciated.
The docs suggest you can use the following:
require 'ox'
xml = %{
<top name="sample">
<middle name="second">
<bottom name="third">Rock bottom</bottom>
</middle>
</top>
}
puts Ox.load(xml, mode: :hash)
puts Ox.load(xml, mode: :hash_no_attrs)
#{:top=>[{:name=>"sample"}, {:middle=>[{:name=>"second"}, {:bottom=>[{:name=>"third"}, "Rock bottom"]}]}]}
#{:top=>{:middle=>{:bottom=>"Rock bottom"}}}
I'm not sure that's exactly what you're looking for though.
Otherwise, it really depends on the methods available on the Ox::Element instances in the array.
From the docs, it looks like there are two handy methods here: you can use [] and text.
Therefore, I'd use reduce to coerce the array into the hash format you're looking for, using something like the following:
ox_nodes = [#<Ox::Element:0x00007f80d985a668 #value="payee", #attributes={}, #nodes=["starbucks"]>, #<Ox::Element:0x00007f80d9839198 #value="amount", #attributes={}, #nodes=["5.75"]>, #<Ox::Element:0x00007f80d9028788 #value="date", #attributes={}, #nodes=["2017-06-10"]>]
ox_nodes.reduce({}) do |hash, node|
hash[node['#value']] = node.text
hash
end
I'm not sure whether node['#value'] will work, so you might need to experiment with that - otherwise perhaps node.instance_variable_get('#value') would do it.
node.text does the following, which sounds about right:
Returns the first String in the elements nodes array or nil if there is no String node.
N.B. I prefer to tidy the reduce block a little using tap, something like the following:
ox_nodes.reduce({}) do |hash, node|
hash.tap { |h| h[node['#value']] = node.text }
end
Hope that helps - let me know how you get on!
I found the answer to the question in my last comment by myself:
def create_xml(expense)
Ox.default_options=({:with_xml => false})
doc = Ox::Document.new(:version => '1.0')
expense.each do |key, value|
e = Ox::Element.new(key)
e << value
doc << e
end
Ox.dump(doc)
end
The next question would be how can i transform the value of the amount key from a string to an integer befopre saving it to the database

Nokogiri - Checking if the value of an xpath exists and is blank or not in Ruby

I have an XML file, and before I process it I need to make sure that a certain element exists and is not blank.
Here is the code I have:
CSV.open("#{csv_dir}/products.csv","w",{:force_quotes => true}) do |out|
out << headers
Dir.glob("#{xml_dir}/*.xml").each do |xml_file|
gdsn_doc = GDSNDoc.new(xml_file)
logger.info("Processing xml file #{xml_file}")
:x
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
row = []
headers.each do |col|
row << product[col]
end
out << row
end
end
end
The following code is not working to find the "description" element and to check whether it is blank or not:
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
Here is a sample of the XML file:
<productData>
<description>Chocolate biscuits </description>
<productData>
This is how I have defined the class and Nokogiri:
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
The code had to be moved up to an earlier stage, where Nokogiri was initialised. It doesn't get runtime errors, but it does let XML files with blank descriptions get through and it shouldn't.
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
desc_exists = #doc.xpath("//productData/descriptions")
if !desc_exists.empty?
You are creating your instance like this:
gdsn_doc = GDSNDoc.new(xml_file)
then use it like this:
#desc_exists = #gdsn_doc.xpath("//productData/description")
#gdsn_doc and gdsn_doc are two different things in Ruby - try just using the version without the #:
#desc_exists = gdsn_doc.xpath("//productData/description")
The basic test is to use:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<productData>
<description>Chocolate biscuits </description>
<productData>
EOT
# using XPath selectors...
doc.xpath('//productData/description').to_html # => "<description>Chocolate biscuits </description>"
doc.xpath('//description').to_html # => "<description>Chocolate biscuits </description>"
xpath works fine when the document is parsed correctly.
I get an error "undefined method 'xpath' for nil:NilClass (NoMethodError)
Usually this means you didn't parse the document correctly. In your case it's because you're not using the right variable:
gdsn_doc = GDSNDoc.new(xml_file)
...
#desc_exists = #gdsn_doc.xpath("//productData/description")
Note that gdsn_doc is not the same as #gdsn_doc. The later doesn't appear to have been initialized.
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
While that should work, it's idiomatic to write it as:
#doc = Nokogiri::XML(File.read(xml_file))
File.open(...) do ... end is preferred if you're processing inside the block and want Ruby to automatically close the file. That isn't necessary when you're simply reading then passing the content to something else for processing, hence the use of File.read(...) which slurps the file. (Slurping isn't necessary a good practice because it can have scalability problems, but for reasonable sized XML/HTML it's OK because it's easier to use DOM-based parsing than SAX.)
If Nokogiri doesn't raise an exception it was able to parse the content, however that still doesn't mean the content was valid. It's a good idea to check
#doc.errors
to see whether Nokogiri/libXML had to do some fix-ups on the content just to be able to parse it. Fixing the markup can change the DOM from what you expect, making it impossible to find a tag based on your assumptions for the selector. You could use xmllint or one of the XML validators to check, but Nokogiri will still have to be happy.
Nokogiri includes a command-line version nokogiri that accepts a URL to the document you want to parse:
nokogiri http://example.com
It'll open IRB with the content loaded and ready for you to poke at it. It's very convenient when debugging and testing. It's also a decent way to make sure the content actually exists if you're dealing with HTML containing DHTML that loads parts of the page dynamically.

Read XML commented value in Ruby

I have XML with additional data in comments at the start of the XML file.
I want to read these details for some logging.
Is there any way to read XML commented values in Ruby?
The commented details are:
<!-- sessionId="QQQQQQQQ" --><!-- ProgramId ="EP445522" -->
Extracting values using Nokogiri and XPath:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML('<!-- sessionId="QQQQQQQQ" --><!-- ProgramId ="EP445522" -->')
comments = doc.xpath('//comment()')
tags = Hash[*comments.map { |c| c.content.match(/(\S+)\s*="(\w+)"/).captures }.flatten]
puts tags.inspect
# => {"sessionId"=>"QQQQQQQQ", "ProgramId"=>"EP445522"}
You could do something like this:
require 'rexml/document'
require 'rexml/xpath'
File::open('q.xml') do |fd|
xml = REXML::Document::new(fd)
REXML::XPath::each(xml.root, '//comment()') do |comment|
case comment.string
when /sessionId/
puts comment.string
when /ProgramId/
puts comment.string
end
end
end
Effectively what this does is loop through all comment nodes, then look for the strings you're interested in, such as sessionId. Once you've found the nodes you're looking for, you can process them using Ruby to extract the information you need.

How to get values in XML data using Nokogiri?

I'm using Nokogiri to parse XML data that I'm getting from the roar engine after I create a user. The XML looks like below:
<roar tick="135098427907">
<facebook>
<create_oauth status="ok">
<auth_token>14802206136746256007</auth_token>
<player_id>8957881063899628798</player_id>
</create_oauth>
</facebook>
</roar>
I'm totally new to Nokogiri. How do I get the value of status, the auth_token and player_id?
str = "<roar ......"
doc = Nokogiri.XML(str)
puts doc.xpath('//create_oauth/#status') # => ok
puts doc.xpath('//auth_token').text # => 148....
# player_id is the same as auth_token
And it is a great idea to learn you some good xpath from w3schools.
How about this
h1 = Nokogiri::XML.parse %{
<roar tick="135098427907">
<facebook>
<create_oauth status="ok">
<auth_token>14802206136746256007</auth_token>
<player_id>8957881063899628798</player_id>
</create_oauth>
</facebook>
</roar>
}
h1.xpath("//facebook/create_oauth/auth_token").text()
h1.xpath("//facebook/create_oauth/player_id").text()
You can use Nori gem. Its a xml to hash converter and in ruby its so much convenient to access hashes
require 'nori'
Nori.parser = :nokogiri
xml = "<roar tick='135098427907'>
<facebook>
<create_oauth status='ok'>
<auth_token>14802206136746256007</auth_token>
<player_id>8957881063899628798</player_id>
</create_oauth>
</facebook>
</roar>"
hash = Nori.parse(xml)
create_oauth = hash["roar"]["facebook"]["create_oauth"]
puts create_oauth["auth_token"] # 14802206136746256007
puts create_oauth["#status"] # ok
puts create_oauth["player_id"] # 8957881063899628798

How to crawl the right way?

I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.
There are exactly 43612 XML files that I want to crawl and store in a CSV file.
My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.
I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074
I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.
My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the necessary lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create bunch of array to store that grabs data:
#urls = Array.new
#ID = Array.new
#titleSv = Array.new
#titleEn = Array.new
#identifier = Array.new
#typeOfLevel = Array.new
Grab all the xml links from a spec site and store them in a array called #urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/#href').each do |links|
#urls << links.content
end
Loop throw the #urls array, and grab every element node that I want to grab with xpath.
#urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
#ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Then store them in a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..#ID.length - 1).each do |index|
row << [#ID[index], #titleSv[index], #titleEn[index], #identifier[index], #typeOfLevel[index], #typeOfResponsibleBody[index], #courseTyp[index], #credits[index], #degree[index], #preAcademic[index], #subjectCodeVhs[index], #descriptionSv[index], #lastedited[index], #expires[index]]
end
end
It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.
Libraries
You're using a lot of libraries here that probably aren't necessary.
You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).
Use Hashes
Instead of storing data at index in 15 arrays, have one set of hashes.
For instance,
items = Set.new
doc.xpath('//a/#href').each do |url|
item = {}
item[:url] = url.content
items << item
end
items.each do |item|
xml = Nokogiri::XML(open(item[:url]))
item[:id] = xml.root['id']
...
end
Collect the data, then write to file
Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.
Be DRY
In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
#titleSv << m
}
Move what's common to a method
def get_value(xml, path)
str = ''
xml.elements.each(path) do |e|
str = e.text.to_s
next if str.empty?
end
str
end
And move anything constant to another hash
xml_paths = {
:title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
:title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
...
}
Now you can combine these techniques to make for much cleaner codes
item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])
I hope this helps!
It won't work without your fixings. And I believe you should do like #Ian Bishop said to refactor your parsing code
require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'
class Links < Pioneer::Base
include REXML
def locations
["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
end
def processing(req)
doc = Nokogiri::HTML(req.response.response)
htmldoc.xpath('//a/#href').map do |links|
links.content
end
end
end
class Crawler < Pioneer::Base
include REXML
def locations
Links.new.start.flatten
end
def processing(req)
xmldoc = REXML::Document.new(req.respone.response)
root = xmldoc.root
id = root.attributes["id"]
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
title = e.text.to_s
CSV.open("eduction_normal.csv", "a") do |f|
f << [id, title ...]
end
end
end
end
Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)
If you really want to speed it up, you're going to have to go concurrent.
One of the simplest ways is to install JRuby and then run your application with one small modification: install either the 'peach' or 'pmap' gems and then change your items.each to items.peach(n) (parallel each), where n is the number of threads. You'll need at least one thread per CPU core, but if you put I/O in your loop then you'll want more.
Also, use Nokogiri, it's much faster. Ask a separate Nokogiri question if you need to solve something specific with Nokogiri. I'm sure it can do what you need.

Resources