Mapping XML to Ruby objects - ruby

I want to communicate between ruby and other applications in XML. I have defined a schema for this communication and I'm looking for the best way to do the transformation from data in Ruby to the XML and vice versa.
I have an XML document my_document.xml:
<myDocument>
<number>1</number>
<distance units="km">20</distance>
</myDocument>
Which conforms to an Schema my_document_type.xsd (I shalln't bother writing it out here).
Now I'd like to have the following class automatically generated from the XSD - is this reasonable or feasible?
# Represents a document created in the form of my_document_type.xsd
class MyDocument
attr_accessor :number, :distance, :distance_units
# Allows me to create this object from data in Ruby
def initialize(data)
#number = data['number']
#distance = data['distance']
#distance_units = data['distance_units']
end
# Takes an XML document of the correct form my_document.xml and populates internal systems
def self.from_xml(xml)
# Reads the XML and populates:
doc = ALibrary.load(xml)
#number = doc.xpath('/number').text()
#distance = doc.xpath('/distance').text()
#distance_units = doc.xpath('/distance').attr('units') # Or whatever
end
def to_xml
# Jiggery pokery
end
end
So that now I can do:
require 'awesomelibrary'
awesome_class = AwesomeLibrary.load_from_xsd('my_document_type.xsd')
doc = awesome_class.from_xml('my_document.xml')
p doc.distance # => 20
p doc.distance_units # => 'km'
And I can also do
doc = awesome_class.new('number' => 10, 'distance_units' => 'inches', 'distance' => '5')
p doc.to_xml
And get:
<myDocument>
<number>10</number>
<distance units="inches">5</distance>
</myDocument>
This sounds like fairly intense functionality to me, so I'm not expecting a full answer, but any tips as to libraries which already do this (I've tried using RXSD, but I can't figure out how to get it to do this) or any feasibility thoughts and so on.
Thanks in advance!

Have you tried Nokogiri? The Slop decorator implements method_missing into the document in such a way that it essentially duplicates the functionality you're looking for.

Related

Naming different files. Handler? IO? Stream? Processor? Controller?

I'm having some trouble naming some files that I wrote. I don't really know the different between a stream, I/O, a handler, a processor (Is this a real concept?), and a controller. These are what my files look like in Ruby:
Starting from the rakefile:
desc "Calculate chocolate totals from a CSV of orders"
task :redeem_orders, [:orders_csv_path, :redemptions_csv_path] do |t, args|
args.with_defaults(:orders_csv_path => "./public/input/orders.csv", :redemptions_csv_path => "./public/output/redemptions.csv")
DataController.transfer(
input_path: args[:orders_csv_path],
output_path: args[:redemptions_csv_path],
formatter: ChocolateTotalsFormatter,
converter: ChocolateTotalsConverter
)
end
Then the controller (which in my mind delegates between different classes with the data obtained from the rakefile):
class DataController
def self.transfer(input_path:, output_path:, formatter:, converter:)
data_processor = DataProcessor.new(
input_path: input_path,
output_path: output_path,
formatter: formatter
)
export_data = converter.convert(data_processor.import)
data_processor.export(export_data)
end
end
The processor (which performs imports and exports according to the various files that were passed into this file):
class DataProcessor
attr_reader :input_path,
:output_path,
:formatter,
:input_file_processor,
:output_file_processor
def initialize(input_path:, output_path:, formatter:)
#input_path = input_path
#output_path = output_path
#formatter = formatter
#input_file_processor = FileProcessorFactory.create(File.extname(input_path))
#output_file_processor = FileProcessorFactory.create(File.extname(output_path))
end
def import
formatter.format_input(input_file_processor.read(input_path: input_path))
end
def export(export_data)
output_file_processor.write(
output_path: output_path,
data: formatter.format_output(export_data)
)
end
end
the converter referenced in the controller looks like this (it converts data that was passed in to a different format... I'm more confident about this naming):
class ChocolateTotalsConverter
def self.convert(data)
data.map do |row|
ChocolateTotalsCalculator.new(row).calculate
end
end
end
And the FileProcessorFactory in the above code snippet creates a file like this one that actually does the reading and the writing to CSV:
require 'csv'
class CSVProcessor
include FileTypeProcessor
def self.read(input_path:, with_headers: true, return_headers: false)
CSV.read(input_path, headers: with_headers, return_headers: return_headers, converters: :numeric)
end
def self.write(output_path:, data:, write_headers: false)
CSV.open(output_path, "w", write_headers: write_headers) do |csv|
data.each do |row|
csv << row
end
end
end
end
I'm having trouble with naming. Does it looks like I named things correctly? What should be named something like DataIO vs DataProcessor? What should a file named DataStream be doing? What about something that's a converter?
Ruby isn't a kingdom of nouns. Some programmers hear "everything is an object" and think "I am processing data, therefore I need a DataProcessor object!" But in Ruby, "everything is an object". There's only one novel "thing" in your example: a chocolate order (maybe redemptions, too). So you only need one custom class: ChocolateOrder. The other "things" we already have objects for: CSV represents the CSV file, Array (or Set or Hash) can represent the collection of chocolate orders.
Processing a CSV row into an order, converting an order into workable data, and totaling those data into a result aren't "things". They're actions! In Ruby, actions are methods, blocks, procs, lambdas, or top-level functions*. In your case I see a method like ChocolateOrder#payment for getting just the price to add up, then maybe some blocks for the rest of the processing.
In pseudocode I imagine something like this:
# input
orders = CSV.foreach(input_file).map do |row|
# get important stuff out of the row
Order.new(x, y, z)
end
# processing
redemptions = orders.map { |order| order.get_redemption }
# output
CSV.open(output_file, "wb") do |csv|
redemptions.each do |redemption|
# convert redemption to an array of strings
csv << redemption_ary
end
end
If your rows are really simple, I would even consider just setting headers:true on the CSV so it returns Hash and leave orders as that.
* Procs, lambdas, and top-level functions are objects too. But that's beside the point.
This seems like quite a 'java' way of thinking - in Ruby I haven't seen patterns like this used very often. I'd say that you might only really need the DataProcessor class. CSVProcessor and ChocolateTotalsConverter have only class methods, which might be more idiomatic if they were instance methods of DataProcessor instead. I'd start there and see how you feel about it.

Parsing Large xml file [duplicate]

So I'm attempting to parse a 400k+ line XML file using Nokogiri.
The XML file has this basic format:
<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
<DisorderList count="6760">
*** Repeated Many Times ***
<Disorder id="17601">
<OrphaNumber>166024</OrphaNumber>
<Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
<DisorderSignList count="18">
<DisorderSign>
<ClinicalSign id="2040">
<Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
</ClinicalSign>
<SignFreq id="640">
<Name lang="en">Very frequent</Name>
</SignFreq>
</DisorderSign>
</Disorder>
*** Repeated Many Times ***
</DisorderList>
</JDBOR>
Here is the code I've created to parse and return each DisorderSign id and name into a database:
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
symptomsList.push([signId, name])
end
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
This works perfect on the test files I've used, although they were much smaller, around 10000 lines.
When I attempt to run this on the large XML file, it simply does not finish. I left it on overnight and it seemed to just lockup. Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory.
Thank you for any help.
I see a few possible problems. First of all, this:
#doc = Nokogiri::XML(sympFile)
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
Then you do things like this:
#doc.xpath(...).each
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet that xpath returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
and finally iterate over that array:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
#data = { }
elsif(name == 'ClinicalSign')
#key = :sign
#data[#key] = ''
elsif(name == 'SignFreq')
#key = :freq
#data[#key] = ''
elsif(name == 'Name')
#in_name = true
end
end
def characters(str)
#data[#key] += str if(#key && #in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump #data into the database here.
#data = nil
elsif(name == 'ClinicalSign')
#key = nil
elsif(name == 'SignFreq')
#key = nil
elsif(name == 'Name')
#in_name = false
end
end
end
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump #data into the database here.
comment.
This structure makes it pretty easy to watch for the <Disorder id="17601"> elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
A SAX Parser is definitly what you want to be using. If you're anything like me and can't jive with the Nokogiri documentation, there is an awesome gem called Saxerator that makes this process really easy.
An example for what you are trying to do --
require 'saxerator'
parser = Saxerator.parser(Temp.xml)
parser.for_tag(:DisorderSign).each do |sign|
signId = sign[:ClinicalSign][:id]
name = sign[:ClinicalSign][:name]
Symtom(:name => name, :id => signId).create!
end
You're likely running out of memory because symptomsList is getting too large in memory size. Why not perform the SQL within the xpath loop?
require 'nokogiri'
sympFile = File.open("Temp.xml")
#doc = Nokogiri::XML(sympFile)
sympFile.close()
#doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
Symptom.where(:name => name, :signid => signId.to_i).first_or_create
end
It's possible too that the file is just too large for the buffer to handle. In that case you could chop it up into smaller temp files and process them individually.
You can also use Nokogiri::XML::Reader. It's more memory intensive that Nokogiri::XML::SAX parser but you can keep XML structure, e.x.
class NodeHandler < Struct.new(:node)
def process
# Node processing logic
#e.x.
signId = node.at('ClinicalSign').attribute('id').text()
name = node.at('ClinicalSign').element_children().text()
end
end
Nokogiri::XML::Reader(File.open('./test/fixtures/example.xml')).each do |node|
if node.name == 'DisorderSign' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
NodeHandler.new(
Nokogiri::XML(node.outer_xml).at('./DisorderSign')
).process
end
end
Based on this blog

How to read data from a different file without using YAML or JSON

I'm experimenting with a Ruby script that will add data to a Neo4j database using REST API. (Here's the tutorial with all the code if interested.)
The script works if I include the hash data structure in the initialize method but I would like to move the data into a different file so I can make changes to it separately using a different script.
I'm relatively new to Ruby. If I copy the following data structure into a separate file, is there a simple way to read it from my existing script when I call #data? I've heard one could do something with YAML or JSON (not familiar with how either work). What's the easiest way to read a file and how could I go about coding that?
#I want to copy this data into a different file and read it with my script when I call #data.
{
nodes:[
{:label=>"Person", :title=>"title_here", :name=>"name_here"}
]
}
And here is part of my code, it should be enough for the purposes of this question.
class RGraph
def initialize
#url = 'http://localhost:7474/db/data/cypher'
#If I put this hash structure into a different file, how do I make #data read that file?
#data = {
nodes:[
{:label=>"Person", :title=>"title_here", :name=>"name_here"}
]
}
end
#more code here... not relevant to question
def create_nodes
# Scan file, find each node and create it in Neo4j
#data.each do |key,value|
if key == :nodes
#data[key].each do |node| # Cycle through each node
next unless node.has_key?(:label) # Make sure this node has a label
#WE have sufficient data to create a node
label = node[:label]
attr = Hash.new
node.each do |k,v| # Hunt for additional attributes
next if k == :label # Don't create an attribute for "label"
attr[k] = v
end
create_node(label,attr)
end
end
end
end
rGraph = RGraph.new
rGraph.create_nodes
end
Given that OP said in comments "I'm not against using either of those", let's do it in YAML (which preserves the Ruby object structure best). Save it:
#data = {
nodes:[
{:label=>"Person", :title=>"title_here", :name=>"name_here"}
]
}
require 'yaml'
File.write('config.yaml', YAML.dump(#data))
This will create config.yaml:
---
:nodes:
- :label: Person
:title: title_here
:name: name_here
If you read it in, you get exactly what you saved:
require 'yaml'
#data = YAML.load(File.read('config.yaml'))
puts #data.inspect
# => {:nodes=>[{:label=>"Person", :title=>"title_here", :name=>"name_here"}]}

Parse/read Large XML file with minimal memory footprint

I have a very large XML file (300mb) of the following format:
<data>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
</data>
Now I need to read it and iterate through the point nodes doing something for each. Currently I'm doing it with Nokogiri like this:
require 'nokogiri'
xmlfeed = Nokogiri::XML(open("large_file.xml"))
xmlfeed.xpath("./data/point").each do |item|
save_id(item.xpath("./id").text)
end
However that's not very efficient, since it parses everything whole hug, and hence creating a huge memory footprint (several GB).
Is there a way to do this in chunks instead? Might be called streaming if I'm not mistaken?
EDIT
The suggested answer using nokogiris sax parser might be okay, but it gets very messy when there is several nodes within each point that I need to extract content from and process differently. Instead of returning a huge array of entries for later processing, I would much rather prefer if I could access one point at a time, process it, and then move on to the next "forgetting" the previous.
Given this little-known (but AWESOME) gist using Nokogiri's Reader interface, you should be able to do this:
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
inside_element 'point' do
for_element 'id' do puts "ID: #{inner_xml}" end
for_element 'time' do puts "Time: #{inner_xml}" end
end
end
Someone should make this a gem, perhaps me ;)
Use Nokogiri::XML::SAX::Parser (event-driven parser) and Nokogiri::XML::SAX::Document:
require 'nokogiri'
class IDCollector < Nokogiri::XML::SAX::Document
attr :ids
def initialize
#ids = []
#inside_id = false
end
def start_element(name, attrs)
# NOTE: This is simplified. You need some kind of stack manipulations
# (push in start_element / pop in end_element)
# to correctly pick `.//data/point/id` elements.
#inside_id = true if name == 'id'
end
def end_element(name)
#inside_id = false
end
def cdata_block string
#ids << string if #inside_id
end
end
collector = IDCollector.new
parser = Nokogiri::XML::SAX::Parser.new(collector)
parser.parse(File.open('large_file.xml'))
p collector.ids # => ["1371308", "1371308", "1371308"]
According to the documentation,
Nokogiri::XML::SAX::Parser: is a SAX style parser that reads its
input as it deems necessary.
You can also use Nokogiri::XML::SAX::PushParser if you need more control over the file input.
If you use jruby, you can take advantage of vtd-xml, which has the most efficient in memory model, 3~5x more efficient than DOM..
http://vtd-xml.sf.net

List dynamic attributes in a Mongoid Model

I have gone over the documentation, and I can't find a specific way to go about this. I have already added some dynamic attributes to a model, and I would like to be able to iterate over all of them.
So, for a concrete example:
class Order
include Mongoid::Document
field :status, type: String, default: "pending"
end
And then I do the following:
Order.new(status: "processed", internal_id: "1111")
And later I want to come back and be able to get a list/array of all the dynamic attributes (in this case, "internal_id" is it).
I'm still digging, but I'd love to hear if anyone else has solved this already.
Just include something like this in your model:
module DynamicAttributeSupport
def self.included(base)
base.send :include, InstanceMethods
end
module InstanceMethods
def dynamic_attributes
attributes.keys - _protected_attributes[:default].to_a - fields.keys
end
def static_attributes
fields.keys - dynamic_attributes
end
end
end
and here is a spec to go with it:
require 'spec_helper'
describe "dynamic attributes" do
class DynamicAttributeModel
include Mongoid::Document
include DynamicAttributeSupport
field :defined_field, type: String
end
it "provides dynamic_attribute helper" do
d = DynamicAttributeModel.new(age: 45, defined_field: 'George')
d.dynamic_attributes.should == ['age']
end
it "has static attributes" do
d = DynamicAttributeModel.new(foo: 'bar')
d.static_attributes.should include('defined_field')
d.static_attributes.should_not include('foo')
end
it "allows creation with dynamic attributes" do
d = DynamicAttributeModel.create(age: 99, blood_type: 'A')
d = DynamicAttributeModel.find(d.id)
d.age.should == 99
d.blood_type.should == 'A'
d.dynamic_attributes.should == ['age', 'blood_type']
end
end
this will give you only the dynamic field names for a given record x:
dynamic_attribute_names = x.attributes.keys - x.fields.keys
if you use additional Mongoid features, you need to subtract the fields associated with those features:
e.g. for Mongoid::Versioning :
dynamic_attribute_names = (x.attributes.keys - x.fields.keys) - ['versions']
To get the key/value pairs for only the dynamic attributes:
make sure to clone the result of attributes(), otherwise you modify x !!
attr_hash = x.attributes.clone #### make sure to clone this, otherwise you modify x !!
dyn_attr_hash = attr_hash.delete_if{|k,v| ! dynamic_attribute_names.include?(k)}
or in one line:
x.attributes.clone.delete_if{|k,v| ! dynamic_attribute_names.include?(k)}
So, what I ended up doing is this. I'm not sure if it's the best way to go about it, but it seems to give me the results I'm looking for.
class Order
def dynamic_attributes
self.attributes.delete_if { |attribute|
self.fields.keys.member? attribute
}
end
end
Attributes appears to be a list of the actual attributes on the object, while fields appears to be a hash of the fields that were predefined. Couldn't exactly find that in the documentation, but I'm going with it for now unless someone else knows of a better way!
try .methods or .instance_variables
Not sure if I liked the clone approach, so I wrote one too. From this you could easily build a hash of the content too. This merely outputs it all the dynamic fields (flat structure)
(d.attributes.keys - d.fields.keys).each {|a| puts "#{a} = #{d[a]}"};
I wasn't able to get any of the above solutions to work (as I didn't want to have to add slabs and slabs of code to each model, and, for some reason, the attributes method does not exist on a model instance, for me. :/), so I decided to write my own helper to do this for me. Please note that this method includes both dynamic and predefined fields.
helpers/mongoid_attribute_helper.rb:
module MongoidAttributeHelper
def self.included(base)
base.extend(AttributeMethods)
end
module AttributeMethods
def get_all_attributes
map = %Q{
function() {
for(var key in this)
{
emit(key, null);
}
}
}
reduce = %Q{
function(key, value) {
return null;
}
}
hashedResults = self.map_reduce(map, reduce).out(inline: true) # Returns an array of Hashes (i.e. {"_id"=>"EmailAddress", "value"=>nil} )
# Build an array of just the "_id"s.
results = Array.new
hashedResults.each do |value|
results << value["_id"]
end
return results
end
end
end
models/user.rb:
class User
include Mongoid::Document
include MongoidAttributeHelper
...
end
Once I've added the aforementioned include (include MongoidAttributeHelper) to each model which I would like to use this method with, I can get a list of all fields using User.get_all_attributes.
Granted, this may not be the most efficient or elegant of methods, but it definitely works. :)

Resources