How to convert partial XML to hash in Ruby - ruby

I have a string which has plain text and extra spaces and carriage returns then XML-like tags followed by XML tags:
String = "hi there.
<SET-TOPIC> INITIATE </SET-TOPIC>
<SETPROFILE>
<KEY>name</KEY>
<VALUE>Joe</VALUE>
</SETPROFILE>
<SETPROFILE>
<KEY>email</KEY>
<VALUE>Email#hi.com</VALUE>
</SETPROFILE>
<GET-RELATIONS>
<COLLECTION>goals</COLLECTION>
<VALUE>walk upstairs</VALUE>
</GET-RELATIONS>
So what do you think?
Is it true?
"
I want to parse this similar to use Nori or Nokogiri or Ox where they convert XML to a hash.
My goal is to be able to easily pull out the top level tags as keys and then know all the elements, something like:
Keys = ['SETPROFILE', 'SETPROFILE', 'SET-TOPIC', 'GET-OBJECT']
Values[0] = [{name => Joe}, {email => email#hi.com}]
Values[3] = [{collection => goals}, {value => walk up}]
I have seen several functions like that for true XML but all of mine are partial.
I started going down this line of thinking:
parsed = doc.search('*').each_with_object({}) do |n, h|
(h[n.name] ||= []) << n.text
end

I'd probably do something along these lines if I wanted the keys and values variables:
require 'nokogiri'
string = "hi there.
<SET-TOPIC> INITIATE </SET-TOPIC>
<SETPROFILE>
<KEY>name</KEY>
<VALUE>Joe</VALUE>
</SETPROFILE>
<SETPROFILE>
<KEY>email</KEY>
<VALUE>Email#hi.com</VALUE>
</SETPROFILE>
<GET-RELATIONS>
<COLLECTION>goals</COLLECTION>
<VALUE>walk upstairs</VALUE>
</GET-RELATIONS>
So what do you think?
Is it true?
"
doc = Nokogiri::XML('<root>' + string + '</root>', nil, nil, Nokogiri::XML::ParseOptions::NOBLANKS)
nodes = doc.root.children.reject { |n| n.is_a?(Nokogiri::XML::Text) }.map { |node|
[
node.name, node.children.map { |c|
[c.name, c.content]
}.to_h
]
}
nodes
# => [["SET-TOPIC", {"text"=>" INITIATE "}],
# ["SETPROFILE", {"KEY"=>"name", "VALUE"=>"Joe"}],
# ["SETPROFILE", {"KEY"=>"email", "VALUE"=>"Email#hi.com"}],
# ["GET-RELATIONS", {"COLLECTION"=>"goals", "VALUE"=>"walk upstairs"}]]
From nodes it's possible to grab the rest of the detail:
keys = nodes.map(&:first)
# => ["SET-TOPIC", "SETPROFILE", "SETPROFILE", "GET-RELATIONS"]
values = nodes.map(&:last)
# => [{"text"=>" INITIATE "},
# {"KEY"=>"name", "VALUE"=>"Joe"},
# {"KEY"=>"email", "VALUE"=>"Email#hi.com"},
# {"COLLECTION"=>"goals", "VALUE"=>"walk upstairs"}]
values[0] # => {"text"=>" INITIATE "}
If you'd rather, it's possible to pre-process the DOM and remove the top-level text:
doc.root.children.select { |n| n.is_a?(Nokogiri::XML::Text) }.map(&:remove)
doc.to_xml
# => "<root><SET-TOPIC> INITIATE </SET-TOPIC><SETPROFILE><KEY>name</KEY><VALUE>Joe</VALUE></SETPROFILE><SETPROFILE><KEY>email</KEY><VALUE>Email#hi.com</VALUE></SETPROFILE><GET-RELATIONS><COLLECTION>goals</COLLECTION><VALUE>walk upstairs</VALUE></GET-RELATIONS></root>\n"
That makes it easier to work with the XML.

Wrap the string content in a node and you can parse that with Nokogiri. The text outside the XML segment will be text node in the new node.
str = "hi there. .... Is it true?"
doc = Nokogiri::XML("<wrapper>#{str}</wrapper>")
segments = doc.xpath('/*/SETPROFILE')
Now you can use "Convert a Nokogiri document to a Ruby Hash" to convert the segments into a hash.
However, if the plain text contains some characters that needs to be escaped in the XML spec you'll need to find those and escape them yourself.

Related

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Create a Ruby Hash out of an xml string with the 'ox' gem

I am currently trying to create a hash out of an xml documen, with the help of the ox gem
Input xml:
<?xml version="1.0"?>
<expense>
<payee>starbucks</payee>
<amount>5.75</amount>
<date>2017-06-10</date>
</expense>
with the following ruby/ox code:
doc = Ox.parse(xml)
plist = doc.root.nodes
I get the following output:
=> [#<Ox::Element:0x00007f80d985a668 #value="payee", #attributes={}, #nodes=["starbucks"]>, #<Ox::Element:0x00007f80d9839198 #value="amount", #attributes={}, #nodes=["5.75"]>, #<Ox::Element:0x00007f80d9028788 #value="date", #attributes={}, #nodes=["2017-06-10"]>]
The output I want is a hash in the format:
{'payee' => 'Starbucks',
'amount' => 5.75,
'date' => '2017-06-10'}
to save in my sqllite database. How can I transform the objects array into a hash like above.
Any help is highly appreciated.
The docs suggest you can use the following:
require 'ox'
xml = %{
<top name="sample">
<middle name="second">
<bottom name="third">Rock bottom</bottom>
</middle>
</top>
}
puts Ox.load(xml, mode: :hash)
puts Ox.load(xml, mode: :hash_no_attrs)
#{:top=>[{:name=>"sample"}, {:middle=>[{:name=>"second"}, {:bottom=>[{:name=>"third"}, "Rock bottom"]}]}]}
#{:top=>{:middle=>{:bottom=>"Rock bottom"}}}
I'm not sure that's exactly what you're looking for though.
Otherwise, it really depends on the methods available on the Ox::Element instances in the array.
From the docs, it looks like there are two handy methods here: you can use [] and text.
Therefore, I'd use reduce to coerce the array into the hash format you're looking for, using something like the following:
ox_nodes = [#<Ox::Element:0x00007f80d985a668 #value="payee", #attributes={}, #nodes=["starbucks"]>, #<Ox::Element:0x00007f80d9839198 #value="amount", #attributes={}, #nodes=["5.75"]>, #<Ox::Element:0x00007f80d9028788 #value="date", #attributes={}, #nodes=["2017-06-10"]>]
ox_nodes.reduce({}) do |hash, node|
hash[node['#value']] = node.text
hash
end
I'm not sure whether node['#value'] will work, so you might need to experiment with that - otherwise perhaps node.instance_variable_get('#value') would do it.
node.text does the following, which sounds about right:
Returns the first String in the elements nodes array or nil if there is no String node.
N.B. I prefer to tidy the reduce block a little using tap, something like the following:
ox_nodes.reduce({}) do |hash, node|
hash.tap { |h| h[node['#value']] = node.text }
end
Hope that helps - let me know how you get on!
I found the answer to the question in my last comment by myself:
def create_xml(expense)
Ox.default_options=({:with_xml => false})
doc = Ox::Document.new(:version => '1.0')
expense.each do |key, value|
e = Ox::Element.new(key)
e << value
doc << e
end
Ox.dump(doc)
end
The next question would be how can i transform the value of the amount key from a string to an integer befopre saving it to the database

Parsing a simple XML-like string with adjacent nodes

I'm using the engtagger gem to classify a sentence according to its parts of speech. The output I get is as follows:
puts text
# => "<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>"
I would have expected the gem to give me an array, but I guess I'll have to coerce this into an array myself.
What I'm eventually trying to get is a nested array something like this:
[["My", "nnp"], ["name", "nn"], ["is", "vbz"], ["Max", "nnp"]]
However I'm not really sure how to approach this with Nokogiri (or another parser library). Here's what I've tried:
(byebug) doc = Nokogiri::XML(text)
#<Nokogiri::XML::Document:0x3fd400286e78 name="document" children=[#<Nokogiri::XML::Element:0x3fd400286900 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd400286464 "My">]>]>
(byebug) Nokogiri.parse(text)
#<Nokogiri::XML::Document:0x3fd40028cd50 name="document" children=[#<Nokogiri::XML::Element:0x3fd40028c7d8 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd40028c378 "My">]>]>
So I've tried two different Nokogiri methods, but both are only showing the first node. How can I get the rest of the adjacent nodes as well?
Alternatively, how can I get the engtagger call to return an array? In the docs, I didn't find an example of how to return an array with all tags, only arrays with one specific kind of tag.
The main thing is that well-formed XML should have a root node. You were receiving the very first node only because it was treated as the root (that said, the topmost) node and as it was closed, Nokogiri considered the XML document to be ended.
Nokogiri::XML("<root>#{text}</root>").
children.first. # get root node
children.map { |e| [e.text, e.name] }. # map to what’s needed
reject { |e| e.last == 'text' } # filter out garbage
That filtering might be more semantically correct:
Nokogiri::XML("<root>#{text}</root>").
children.first.
children.reject { |e| Nokogiri::XML::Text === e }.
map { |e| [e.text, e.name] }
The problem is you're parsing the fragment incorrectly:
require 'nokogiri'
doc = Nokogiri::XML.fragment("<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>")
doc.to_xml # => "<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>"
Nokogiri wants valid XML, but you can get it to accept partial XML chunks using fragment.
At that point you're able to do:
doc.children.each_with_object([]){ |n, a| a << [n.text, n.name] unless n.text? }
# => [["My", "nnp"], ["name", "nn"], ["is", "vbz"], ["Max", "nnp"]]

How can I parse out elements in a "< tag >"?

I have a string:
string = <RECALL>first_name</RECALL>, I'd like to send you something. It'll help you learn more about both me and yourself. What is your email?"
I want to pull out the value "first_name" of the tag <RECALL>.
I used gem crack, but it doesn't behave as I expected:
parsed = Crack::XML.parse(string) =>
{"RECALL"=>"first_name, I'd like to send you something. It'll help you learn more about both me and yourself. What is your email?"}
Maybe XML parsing isn't the right way. What is the way so that I could get the following, desired behavior, instead?
{"RECALL"=>"first_name"}
Does not look like valid XML to me. I would just try to use an REGEXP here:
string = "<RECALL>first_name</RECALL>, I'd like to send you something..."
/<RECALL>(.*)<\/RECALL>/.match(string)[1]
#=> "first_name"
Here's two ways you could get the content of the tags:
string = "<RECALL>first_name</RECALL>"
firstname = string[/<RECALL>([^<]+)</, 1]
firstname # => "first_name"
Parsing strings containing tags gets tricky. It's doable for simple content, but once tags are nested or additional < or > show up, it gets a lot harder.
You can use a trick using an XML parser:
require 'nokogiri'
string = "foo <RECALL>first_name</RECALL> bar"
doc = Nokogiri::XML::DocumentFragment.parse(string)
doc.at('RECALL').text # => "first_name"
Note that I'm using Nokogiri::XML::DocumentFragment.parse. That tells Nokogiri to only expect a partial XML document and relaxes a lot of its normally strict XML rules. Then I can tell the parser to find the <RECALL> tag and grab its contained text.
...wondering if there's a way to extract it (I use Crack to extract it, but it only works if the <tag> is at the end of the string.
This pattern matches mid-string:
str = "foo <RECALL>first_name</RECALL> bar"
str[%r!<RECALL>([^<]+)</RECALL>!, 1] # => "first_name"
This pattern fails if the tag is not at the end of the string:
str[%r!<RECALL>([^<]+)</RECALL>\z!, 1] # => nil
And succeeds if it is at the end of the string:
str = "foo <RECALL>first_name</RECALL>"
str[%r!<RECALL>([^<]+)</RECALL>\z!, 1] # => "first_name"
This is one place where a regexp pattern makes it easier to do something than using a parser.
Using a parser:
require 'nokogiri'
Normally we don't care where a tag occurs in a DOM, but if it's important we can figure out where it is in relation to the other tags. It won't always be this straightforward though:
This returns nil if the tag isn't at the end of the string/DOM:
str = "foo <RECALL>first_name</RECALL> bar"
doc = Nokogiri::XML::DocumentFragment.parse(str)
recall_node = doc.at('RECALL')
recall_node == doc.children.last ? doc.at('RECALL').text : nil # => nil
This returns the text of the node because it is at the end of the DOM:
str = "foo <RECALL>first_name</RECALL>"
doc = Nokogiri::XML::DocumentFragment.parse(str)
recall_node = doc.at('RECALL')
recall_node == doc.children.last ? doc.at('RECALL').text : nil # => "first_name"
This works because every node in a document has an identifier and we can ask whether the node of interest matches the last node in the DOM:
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse("<node>first_name</node> text")
# => #(DocumentFragment:0x3ffc89c3d3e8 {
# name = "#document-fragment",
# children = [
# #(Element:0x3ffc89c3cf9c {
# name = "node",
# children = [ #(Text "first_name")]
# }),
# #(Text " text")]
# })
doc.at('node').object_id.to_s(16) # => "3ffc89c3cf9c"
doc.children.last.object_id.to_s(16) # => "3ffc89c3cec0"
doc = Nokogiri::XML::DocumentFragment.parse("<node>first_name</node>")
# => #(DocumentFragment:0x3ffc89c345cc {
# name = "#document-fragment",
# children = [
# #(Element:0x3ffc89c342c0 {
# name = "node",
# children = [ #(Text "first_name")]
# })]
# })
doc.at('node').object_id.to_s(16) # => "3ffc89c342c0"
doc.children.last.object_id.to_s(16) # => "3ffc89c342c0"

Nokogiri Builder: Replace RegEx match with XML

While using Nokogiri::XML::Builder I need to be able to generate a node that also replaces a regex match on the text with some other XML.
Currently I'm able to add additional XML inside the node. Here's an example;
def xml
Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.parent.add_child("Testing[1] footnote paragraph.")
add_footnotes(xml, 'An Entry')
}
}
end.to_xml
end
# further child nodes WILL be added to footnote
def add_footnotes(xml, text)
xml.footnote text
end
which produces;
<chapter>
<para>Testing[1] footnote paragraph.<footnote>An Entry</footnote></para>
</chapter>
But I need to be able to run a regex replace on the reference [1], replacing it with the <footnote> XML, producing output like the following;
<chapter>
<para>Testing<footnote>An Entry</footnote> footnote paragraph.</para>
</chapter>
I'm making the assumption here that the add_footnotes method would receive the reference match (e.g. as $1), which would be used to pull the appropriate footnote from a collection.
That method would also be adding additional child nodes, such as the following;
<footnote>
<para>Words.</para>
<para>More words.</para>
</footnote>
Can anyone help?
Here's a spin on your code that shows how to generate the output. You'll need to refit it to your own code....
require 'nokogiri'
FOOTNOTES = {
'1' => 'An Entry'
}
child_text = "Testing[1] footnote paragraph."
pre_footnote, footnote_id, post_footnote = /^(.+)\[(\d+)\](.+)/.match(child_text).captures
doc = Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.text(pre_footnote)
xml.footnote FOOTNOTES[footnote_id]
xml.text(post_footnote)
}
}
end
puts doc.to_xml
Which outputs:
<?xml version="1.0"?>
<chapter>
<para>Testing<footnote>An Entry</footnote> footnote paragraph.</para>
</chapter>
The trick is you have to grab the text preceding and following your target so you can insert those as text nodes. Then you can figure out what needs to be added. For clarity in your code you should preprocess all the text, get your variables figured out, then fall into the XML generator. Don't try to do any calculations inside the Builder block, instead just reference variables. Think of Builder like a view in an MVC-type application if that helps.
FOOTNOTES could actually be a database lookup, a hash or some other data container.
You should also look at the << method, which lets you inject XML source, so you could pre-build the footnote XML, then loop over an array containing the various footnotes and inject them. Often it's easier to pre-process, then use gsub to treat things like [1] as placeholders. See "gsub(pattern, hash) → new_str" in the documentation, along with this example:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
For instance:
require 'nokogiri'
text = 'this is[1] text and[2] text'
footnotes = {
'[1]' => 'some',
'[2]' => 'more'
}
footnotes.keys.each do |k|
v = footnotes[k]
footnotes[k] = "<footnote>#{ v }</footnote>"
end
replacement_xml = text.gsub(/\[\d+\]/, footnotes) # => "this is<footnote>some</footnote> text and<footnote>more</footnote> text"
doc = Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para { xml.<<(replacement_xml) }
}
end
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <chapter>
# >> <para>this is<footnote>some</footnote> text and<footnote>more</footnote> text</para>
# >> </chapter>
I can try as below :
require 'nokogiri'
def xml
Nokogiri::XML::Builder.new do |xml|
xml.chapter {
xml.para {
xml.parent.add_child("Testing[1] footnote paragraph.")
add_footnotes(xml, 'add text',"[1]")
}
}
end.to_xml
end
def add_footnotes(xml, text,ref)
string = xml.parent.child.content
xml.parent.child.content = ""
string.partition(ref).each do |txt|
next xml.text(txt) if txt != ref
xml.footnote text
end
end
puts xml
# >> <?xml version="1.0"?>
# >> <chapter>
# >> <para>Testing<footnote>add text</footnote> footnote paragraph.</para>
# >> </chapter>

Resources