Parsing XML document missing enclosing parent entity - ruby

I want to process an XML document that lacks an overarching enclosing entity. (Yes, that's the file I'm given. No, I didn't create it.) For example:
<DeviceInfo>
<Greeting>Crunchy bacon!</Greeting>
</DeviceInfo>
<InstantaneousDemand>
<TimeStamp>0x1c722845</TimeStamp>
</InstantaneousDemand>
<InstantaneousDemand>
<TimeStamp>0x1c72284a</TimeStamp>
</InstantaneousDemand>
When I parse the file using Nokogiri's XML method, it (predictably) only reads the first entity:
>> doc = Nokogiri::XML(File.open("x.xml"))
>> doc.children.count
=> 1
doc.text
=> "\n Crunchy bacon!\n"
I could read the file as a string and wrap a fake enclosing entity around the whole thing, but that seems heavy handed. Is there a better way to get Nokogiri to read in all the entities?

You might create a DocumentFragment rather than Document (especially taking into account that your content is actually a document fragment):
▶ doc = Nokogiri::XML::DocumentFragment.parse File.read("x.xml")
#⇒ #<Nokogiri::XML::DocumentFragment:0x14efa38 name="#document-fragment"
# ...
# #<Nokogiri::XML::Element:0x14ef68c name="InstantaneousDemand"
# ...
▶ doc.children.count
#⇒ 6
Hope it helps.

Related

Trouble reading and binding yml data in ruby

I have a yaml file that includes the following:
:common
:substitue
:foo: fee
I read this data like:
data = YAML.load(erb_data[File.basename(__FILE__, '.*')].result(binding))
common = data[:common]
def substitute_if_needed(original_value)
mapping = common.dig(:substitue, original_value)
if mapping.nil? ? original_value : mapping
end
Unfortunately, this doesn't do the substitution that I want. I want to call substitute_if_needed('foo') and get 'fee' back. I also want to call substitute_if_needed('bar') and get 'bar' back.
How can I do this?
There are several problems in your code:
YAML example looks broken. The proper one should looks like:
common:
substitute:
foo: fee
You're trying to fetch common key in common = data[:common] using a symbol as a key, but it should be a string (data["common"]). Also, I'd say it's a bad idea to spilt fetching logic into two pieces - first extract "common" outside of substitute_when_needed and then dig into it inside.
if statement is broken. It should be either proper if or proper ternary operator.
Fixing all this gives us something like (I've just replaced a file with StringIO for convenience - to make the snippet executable as is):
yaml = StringIO.new(<<~DATA)
common:
substitute:
foo: fee
DATA
def substitute_if_needed(data, original_value)
mapping = data.dig("common", "substitute", original_value)
mapping.nil? ? original_value : mapping
end
data = YAML.load(yaml)
substitute_if_needed(data, "foo") # => "fee"
substitute_if_needed(data, "bar") # => "bar"

How to add a namespace to existing xml file

I want to open this file and get all elements that start with us-gaap.
ftp://ftp.sec.gov/edgar/data/916789/0001558370-15-001143.txt
To get elements I tried like this:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc = Nokogiri::XML(File.read(str))
doc.xpath('//us-gaap:*')
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //us-gaap:*
from /Users/ironsand/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/searchable.rb:165:in `evaluate'
doc.namespaces returns {}, so I think I have to add namespace us-gaap.
There are some questions about "adding namespace with Nokogiri", but it looks like about how to create a new XML document, not how to add a namespace to existing documents.
How can I add a namespace to existing document?
I know I can remove the namespace by Nokogiri::XML::Document#remove_namespaces!, but I don't want to use it because it removes also necesarry information.
You have asked an XY Problem. You think that the problem is that you need to add a missing namespace; the real problem is that the file you're trying to parse is not valid XML.
require 'nokogiri'
doc = Nokogiri.XML( IO.read('0001558370-15-001143.txt') )
doc.errors.length
#=> 5716
For example, the <ACCEPTANCE-DATETIME> 'element' opened on line 3 is never closed, and on line 16 there is a raw ampersand in the text:
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRIC HOUSEWARES & FANS [3634]
which ought to be escaped as an entity.
However, the document has valid XML fragments within it! In particular, there is one XML document that defines xmlns:us-gaap namespace, from lines 27243-49312. Let's extract just that, using only the knowledge that the root element defines the namespace we want, and the assumptions that no element with the same name is nested within the document, and that the root element does not have an unescaped > character in any attribute. (These assumptions are valid for this file, but may not be valid for every XML file.)
txt = IO.read('0001558370-15-001143.txt')
gaap_finder = %r{(<(\w+) [^>]+xmlns:us-gaap=.+?</\2>)}m
txt.scan(gaap_finder) do |xml,_|
doc = Nokogiri.XML( xml )
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
#=> 569
end
The code above handles the case where there may be more than one XML document in the txt file, though in this case there is only one.
Decoded, the gaap_finder regex says this:
%r{...}m — this is a regular expression (that allows slashes in it, unescaped) with "multiline mode", where a period will match newline characters
(...) — capture everything we find
< — start with a literal "less-than" symbol
(\w+) — find one or more word characters (the tag name), and save them
— the word characters must be followed by a space (important to avoid capturing the <xsd:xbrl ...> element in this file)
[^>]+ — followed by one or more characters that is NOT a "greater-than" symbol (to ensure that we stay in the same element that we started in)
xmlns:us-gaap\s*= — followed by this literal namespace declaration (which may have whitespace separating it from the equals sign)
.+? — followed by anything (as little as possible)...
</\2> — ...up until you see a closing tag with the same name as what we captured for the name of the starting tag
Because of the way scan works when the regex has capturing groups, each result is a two-element array, where the first element is the entire captured XML and the second element is the name of the tag that we captured (which we "discard" by assigning it to the _ variable).
If you want to be less magic about your capturing, the text file format appears to always wrap each XML document in <XBRL>...</XBRL>. So, you could do this to process every XML file (there are seven, five of which do not happen to have any us-gaap namespaces):
txt = IO.read('0001558370-15-001143.txt')
xbrls = %r{(?<=<XBRL>).+?(?=</XBRL>)}m # find text inside <XBRL>…</XBRL>
txt.scan(xbrls) do |xml|
doc = Nokogiri.XML( xml )
if doc.namespaces["xmlns:us-gaap"]
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
end
end
#=> 569
#=> 0 (for the XML Schema document that defines the namespace)
I couldn't figure out how to update an existing doc with a new namespace, but since Nokogiri will recognize namespaces on the root element, and those namespaces are, syntactically, just attributes, you can update the document with a new namespace declaration, serialize the doc to a string, and re-parse it:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc_without_ns = Nokogiri::XML(str)
doc_without_ns.root['xmlns:us-gaap'] = 'http://your/actual/ns/here'
doc = Nokogiri::XML(doc_without_ns.to_xml)
doc.xpath("//us-gaap:*")
# Returns [#<Nokogiri::XML::Element:0x3ff375583f9c name="foo" namespace=#<Nokogiri::XML::Namespace:0x3ff375583f24 prefix="us-gaap" href="http://your/actual/ns/here"> children=[#<Nokogiri::XML::Text:0x3ff375583768 "foo">]>]

Disable HTML within XML escaping with Nokogiri

I'm trying to parse an XML document from the Google Directions API.
This is what I've got so far:
x = Nokogiri::XML(GoogleDirections.new("48170", "48104").xml)
x.xpath("//DirectionsResponse//route//leg//step").each do |q|
q.xpath("html_instructions").each do |h|
puts h.inner_html
end
end
The output looks like this:
Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>
Turn <b>right</b> onto <b>N Territorial Rd</b>
Turn <b>left</b> onto <b>Gotfredson Rd</b>
...
I would like the output to be:
Turn <b>right</b> onto <b>N Territorial Rd</b>
The problem seems to be Nokogiri escaping the html within the xml
I trust Google, but I think it would be also good to sanitize it further to:
Turn right onto N Territorial Rd
But I can't (using sanitize perhaps) without the raw xml. Ideas?
Because I don't have the Google Directions API installed I can't access the XML, but I have a strong suspicion the problem is the result of telling Nokogiri you're dealing with XML. As a result it's going to return you the HTML encoded like it should be in XML.
You can unescape the HTML using something like:
CGI::unescape_html('Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>')
=> "Head <b>south</b> on <b>Hidden Pond Dr</b> toward <b>Ironwood Ct</b>\n"
unescape_html is an alias to unescapeHTML:
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" "
I had to think about this a bit more. It's something I've run into, but it was one of those things that escaped me during the rush at work. The fix is simple: You're using the wrong method to retrieve the content. Instead of:
puts h.inner_html
Use:
puts h.text
I proved this using:
require 'httpclient'
require 'nokogiri'
# This URL comes from: https://developers.google.com/maps/documentation/directions/#XML
url = 'http://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Los+Angeles,CA&waypoints=Joplin,MO|Oklahoma+City,OK&sensor=false'
clnt = HTTPClient.new
doc = Nokogiri::XML(clnt.get_content(url))
doc.search('html_instructions').each do |html|
puts html.text
end
Which outputs:
Head <b>south</b> on <b>S Federal St</b> toward <b>W Van Buren St</b>
Turn <b>right</b> onto <b>W Congress Pkwy</b>
Continue onto <b>I-290 W</b>
[...]
The difference is that inner_html is reading the content of the node directly, without decoding. text decodes it for you. text, to_str and inner_text are aliased to content internally in Nokogiri::XML::Node for our parsing pleasure.
Wrap your nodes in CDATA:
def wrap_in_cdata(node)
# Using Nokogiri::XML::Node#content instead of #inner_html (which
# escapes HTML entities) so nested nodes will not work
node.inner_html = node.document.create_cdata(node.content)
node
end
Nokogiri::XML::Node#inner_html escapes HTML entities except in CDATA sections.
fragment = Nokogiri::HTML.fragment "<div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>"
puts fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span></div>
fragment.xpath(".//span").each {|node| node.inner_html = node.document.create_cdata(node.content) }
fragment.inner_html
# <div>Here is an unescaped string: <span>Turn left > right > straight & reach your destination.</span>\n</div>
This is not a great or DRY solution, but it works:
puts h.inner_html.gsub("<b>" , "").gsub("</b>", "").gsub("<div style=\"font-size:0.9em\">", "").gsub("</div>", "")

Converting Jsonp to Json in different methods

I been trying to use JSONP data in a json format in a ruby project.
From your experiences how did you address this?
JSONP is easy to handle. It's just JSON in a minor wrapper, and that wrapper is easy to strip off:
require 'open-uri'
require 'json'
URL = 'http://www.google.com/dictionary/json?callback=a&sl=en&tl=en&q=epitome'
jsonp = open(URL).read
jsonp now contains the result in JSONP format:
jsonp[0, 3] # => "a({"
jsonp[-11 ... -1] # => "},200,null"
Those extraneous parts, a{ and ,200,null" are the trouble spots when passing the data to JSON for parsing, so we strip them.
A simple, greedy, regex is all that's needed. /{.+}/ will find everything wrapped by the outermost curly-braces and return it, which is all the JSON needs:
data = JSON.parse(jsonp[/{.+}/])
data['query'] # => "epitome"
data['primaries'].size # => 1
From my experience, one way is to use this regex to filter out the function callback name:
/(\{.*\})/m
or the lazy way would be find the index of the first occurrence of "(" and just substring it with last character, which would be a ")" .
I been trying to look for answers on here, didn't get a solid answer, hope this helps.
Cheers

Rails String Interpolation in a string from a database

So here is my problem.
I want to retrieve a string stored in a model and at runtime change a part of it using a variable from the rails application. Here is an example:
I have a Message model, which I use to store several unique messages. So different users have the same message, but I want to be able to show their name in the middle of the message, e.g.,
"Hi #{user.name}, ...."
I tried to store exactly that in the database but it gets escaped before showing in the view or gets interpolated when storing in the database, via the rails console.
Thanks in advance.
I don't see a reason to define custom string helper functions. Ruby offers very nice formatting approaches, e.g.:
"Hello %s" % ['world']
or
"Hello %{subject}" % { subject: 'world' }
Both examples return "Hello world".
If you want
"Hi #{user.name}, ...."
in your database, use single quotes or escape the # with a backslash to keep Ruby from interpolating the #{} stuff right away:
s = 'Hi #{user.name}, ....'
s = "Hi \#{user.name}, ...."
Then, later when you want to do the interpolation you could, if you were daring or trusted yourself, use eval:
s = pull_the_string_from_the_database
msg = eval '"' + s + '"'
Note that you'll have to turn s into a double quoted string in order for the eval to work. This will work but it isn't the nicest approach and leaves you open to all sorts of strange and confusing errors; it should be okay as long as you (or other trusted people) are writing the strings.
I think you'd be better off with a simple micro-templating system, even something as simple as this:
def fill_in(template, data)
template.gsub(/\{\{(\w+)\}\}/) { data[$1.to_sym] }
end
#...
fill_in('Hi {{user_name}}, ....', :user_name => 'Pancakes')
You could use whatever delimiters you wanted of course, I went with {{...}} because I've been using Mustache.js and Handlebars.js lately. This naive implementation has issues (no in-template formatting options, no delimiter escaping, ...) but it might be enough. If your templates get more complicated then maybe String#% or ERB might work better.
one way I can think of doing this is to have templates stored for example:
"hi name"
then have a function in models that just replaces the template tags (name) with the passed arguments.
It can also be User who logged in.
Because this new function will be a part of model, you can use it like just another field of model from anywhere in rails, including the html.erb file.
Hope that helps, let me know if you need more description.
Adding another possible solution using Procs:
#String can be stored in the database
string = "->(user){ 'Hello ' + user.name}"
proc = eval(string)
proc.call(User.find(1)) #=> "Hello Bob"
gsub is very powerful in Ruby.
It takes a hash as a second argument so you can supply it with a whitelist of keys to replace like that:
template = <<~STR
Hello %{user_email}!
You have %{user_voices_count} votes!
Greetings from the system
STR
template.gsub(/%{.*?}/, {
"%{user_email}" => 'schmijos#example.com',
"%{user_voices_count}" => 5,
"%{release_distributable_total}" => 131,
"%{entitlement_value}" => 2,
})
Compared to ERB it's secure. And it doesn't complain about single % and unused or inexistent keys like string interpolation with %(sprintf) does.

Resources