Get nodes from xml string using regex - ruby

I have string xml like below:
<Query>
<Code>USD</Code>
<Description>United States Dollars</Description>
<UpdateTime>2013-03-04 02:27:33</UpdateTime>
<toUSD>1</toUSD>
<USDto>1</USDto>
<toEUR>2</toEUR>
<EURto>3</EURto>
</Query>
All text is in one line without white spaces. I can't write right regex pattern. I want get nodes which begin like <to. For example <toEUR>, <toUSD>.
How should I write this pattern?

With nokogiri and the xpath function starts-with:
require 'nokogiri'
doc = Nokogiri::XML <<EOF
<Query>
<Code>USD</Code>
<Description>United States Dollars</Description>
<UpdateTime>2013-03-04 02:27:33</UpdateTime>
<toUSD>1</toUSD>
<USDto>1</USDto>
<toEUR>2</toEUR>
<EURto>3</EURto>
</Query>
EOF
doc.search('//*[starts-with(name(),"to")]').map &:to_s
#=> ["<toUSD>1</toUSD>", "<toEUR>2</toEUR>"]

Although the general consensus is that parsing xml etc with regex is not the way to go, something like this should do the trick:
<\s*(to[^>\s]+)[^>]*>([^<]+)<\s*/\s*\1\s*>
In ruby format:
/<\s*(to[^>\s]+)[^>]*>([^<]+)<\s*\/\s*\1\s*>/
Matches <toWatever>value</toWhatever> back-reference group 1 returns the name (toWhatever) and back-reference group 2 returns the value.

Related

How to search two paths but get the results in order using Nokogiri

I am trying to search for elements with prefix w and also t or br using Nokogiri.
For example if this is the core of the doc returned from parsing:
<w:t></w:t><w:br></w:br><w:t></w:t>
This search
doc.search('.//w:t','.//w:br')
Results in:
['<w:t></w:t>','<w:t></w:t>','<w:br></w:br>']
Instead I want (the elements are in the order of the original doc):
['<w:t></w:t>','<w:br></w:br>','<w:t></w:t>']
Using CSS selectors you can do this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<t></t><br></br><t></t>
</xml>
EOT
doc.search('t, br')
# => [#<Nokogiri::XML::Element:0x3c name="t">, #<Nokogiri::XML::Element:0x50 name="br">, #<Nokogiri::XML::Element:0x64 name="t">]
doc.search('t, br').map(&:to_html)
# => ["<t></t>", "<br>", "<t></t>"]
CSS selectors are recommended by Nokogiri's authors because they're generally easier and less noisy.
Using XPath, this'd work:
doc.search('//t | //br')
# => [#<Nokogiri::XML::Element:0x3c name="t">, #<Nokogiri::XML::Element:0x50 name="br">, #<Nokogiri::XML::Element:0x64 name="t">]
doc.search('//t | //br').map(&:to_html)
# => ["<t></t>", "<br>", "<t></t>"]
However, your XML has namespaces, and you didn't show us the appropriate namespace declaration so that's left for you to figure out.
See Nokogiri's Namespaces documentation for more information.
Thanks to the Tin Man's response, the answer I was looking for is this
doc.search('.//w:t | .//w:br')

xpath multiple nodes query with custom strings

I have a working multiple node xpath query and I want to add some custom strings between the results.
<FooBar>
<Foo>
<Fooid>A</Fooid>
<Booid>222</Booid>
<Wooid>Z</Wooid>
</Foo>
<Foo>
<Fooid>B</Fooid>
<Booid>333</Booid>
<Wooid>Y</Wooid>
</Foo>
<Foo>
<Fooid>C</Fooid>
<Booid>444</Booid>
<Wooid>X</Wooid>
</Foo>
</FooBar>
I have messed with different combinations of string-joins and/or concats, but the result was always wrong or ended up in a syntax-error. My xpath version is Xpath 2.0
//Foo/Fooid | //Foo/Booid | Foo/Wooid
The above xpath results in:
A
222
Z
My preferred result would be:
(A)
{222}
[Z]
what is the correct usage of string-join in order to get the brackets around the three ids?
after doing some research and with your comments, I was able to achive the desired solution with this line:
//Foo/concat('(', Fooid, ')'), //Foo/concat('{', Booid, '}'),Foo/concat('[', Wooid, ']')
The '|' was replaced by a comma.
to concat these characters, use their html entity instead.
concat('&lpar;', //Fooid, '&rpar;')
for parentheses use
&lpar;
&rpar;
for brackets
&lbrack;
&rbrack;
for brackes
&lbrace;
&rbrace;
See full character entity sets here

How can I extract the node names for fragmented XML document using Ruby?

I an XML-like document which is pre-processed by a system out of my control. The format of the document is like this:
<template>
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
</template>
However, I only get as a text string what is between the <template> tags.
I would like to be able to extract without specifying the tags ahead of time when parsing. I can do this with the Crack gem but only if the tags are at the end of the string and there is only one.
With Crack, I can put a string like
string = "<SETPROFILE><NAME>email</NAME><VALUE>go#go.com</VALUE></SETPROFILE>"
and my output from Crack is:
{"SETPROFILE"=>{"NAME"=>"email", "VALUE"=>"go#go.com"}}
Then I can use a case statement for the possible values I care about.
Given that I need to have multiple <tags> in the string and they cannot be at the end of the string, how can I parse out the node names and the values easily, similar to what I do with crack?
These tags also need to be removed. I would like to continue to use the excellent suggestion from #TinMan.
It works perfectly once I know the name of the tag. The number of tags will be finite. I send the tag to the appropriate method once I know it, but it needs to get parsed out easily first.
Using Nokogiri, you can treat the string as a DocumentFragment, then find the embedded nodes:
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
EOT
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n] = n.text
}
nodes # => {#<Nokogiri::XML::Element:0x3ff96083b744 name="RECALL" children=[#<Nokogiri::XML::Text:0x3ff96083a09c "first_name">]>=>"first_name", #<Nokogiri::XML::Element:0x3ff96083b5c8 name="SETPROFILE" children=[#<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>, #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>=>"", #<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">=>""}
Or, more legibly:
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n.name] = n.text
}
nodes # => {"RECALL"=>"first_name", "SETPROFILE"=>"email", "NAME"=>"email", "VALUE"=>"", "star"=>""}
Getting the content of a particular tag is easy then:
nodes['RECALL'] # => "first_name"
Iterating over all the tags is also easy:
nodes.keys.each do |k|
...
end
You can even replace a tag and its content with text:
doc.at('RECALL').replace('Fred')
doc.to_xml # => "Hello, there Fred. Thanks for giving me your email. \n<SETPROFILE>\n <NAME>email</NAME>\n <VALUE>\n <star/>\n </VALUE>\n</SETPROFILE>. I have just sent you something.\n"
How to replace the nested tags is left to you as an exercise.

Get a specific tag in a node?

I'm using Ruby, XPath and Nokogiri and trying to retrieve d1 from the following XML:
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
This is my code in a loop:
rs = doc.xpath("//a/b1/c/d1").inner_text
puts rs
It returns nothing (No error).
I want to get the text in <d1>.
You don't ask for the text content in your xpath query:
rs = doc.xpath('//a/b1/c/d1/text()')
You're misusing XPath:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<a>
<b1>
<c>
<d1>01/11/2001</d1>
<d2>02/02/2004</d2>
</c>
</b1>
</a>
EOT
doc.at('/a/b1/c/d1').text # => "01/11/2001"
doc.at('//d1').text # => "01/11/2001"
// in XPath-ese means start at the top and look anywhere in your document. Instead, if you're supplying an explicit/absolute selector, start at the top of the document and drill down using '/a/b1/c/d1'. Or, do the simple thing and let the parser search through the document for that particular node using //d1. You can do that if you know there's a single instance of that node.
In my code above, I used at instead of xpath. at returns the first matching node, which is similar to using xpath('//d1').first. xpath returns a NodeSet, which is like an array of nodes, whereas at returns a Node only. Using inner_text on a NodeSet is likely to not give you the results you want, which would be the text of a particular node, so be careful there.
doc.xpath('/a/b1/c/d1/text()').class # => Nokogiri::XML::NodeSet
doc.xpath('//c').inner_text # => "\n 01/11/2001\n 02/02/2004\n "
doc.xpath('/a/b1/c/d1').first.text # => "01/11/2001"
Look at the following lines. Instead of using XPath selectors, I used CSS, which tends to be more readable. Nokogiri supports both.
doc.at('d1').text # => "01/11/2001"
doc.at('a b1 c d1').text # => "01/11/2001"
Also, notice the type of data returned from these two lines:
doc.at('/a/b1/c/d1/text()').class # => Nokogiri::XML::Text
doc.at('/a/b1/c/d1').text.class # => String
While it might seem good/smart to tell the parser to locate the text() node inside <d1>, what will be returned isn't text, and will need to be accessed further to make it usable, so consider forgoing the use of text() unless you know exactly why you need it:
doc.at('/a/b1/c/d1/text()').text # => "01/11/2001"
Finally, Nokogiri has many methods used for locating nodes. As I said above, xpath returns a NodeSet and at returns a Node. xpath is really an XPath-specific version of Nokogiri's search method. search, css and xpath all return NodeSets. at, at_css and at_xpath all return Nodes. The CSS and XPath variants are useful when you have an ambiguous selector that you need to be used as CSS or XPath specifically. Most of the time Nokogiri can figure whether it's CSS or XPath on its own and will do the right thing, so it's OK to use the generic search and at for the majority of your coding. Use the specific versions when you have to specify one or the other.

XPath find text in any text node

I am trying to find a certain text in any text node in a document, so far my statement looks like this:
doc.xpath("//text() = 'Alliance Consulting'") do |node|
...
end
This obviously does not work, can anyone suggest a better alternative?
This expression //text() = 'Alliance Consulting' evals to a boolean.
In case of this test sample:
<r>
<t>Alliance Consulting</t>
<s>
<p>Test string
<f>Alliance Consulting</f>
</p>
</s>
<z>
Alliance Consulting
<y>
Other string
</y>
</z>
</r>
It will return true of course.
Expression you need should evaluate to node-set, so use:
//text()[. = 'Alliance Consulting']
E.g. expression:
count(//text()[normalize-space() = 'Alliance Consulting'])
against the above document will return 3.
To select text nodes which contain 'Alliance Consulting' in the whole string value (e.g. 'Alliance Consulting provides great services') use:
//text()[contains(.,'Alliance Consulting')]
Do note that adjacent text nodes should become one after parser gets to the document.

Resources