This question already has answers here:
Ruby Unit Test : Is this a Valid (well-formed) XML Doc?
(3 answers)
Closed 7 years ago.
I'm wondering if there's a function in Ruby like is_xml?(string) to identify if a given string is XML formatted.
Nokogiri's parse uses a simple regex test looking for <html> in an attempt to determine if the data to be parsed is HTML or XML:
string =~ /^s*<[^Hh>]*html/ # Probably html
Something similar, looking for the XML declaration would be a starting point:
string = '<?xml version="1.0"?><foo><bar></bar></foo>'
string.strip[/\A<\?xml/]
=> "<?xml"
If that returns anything other than nil the string contains the XML declaration. It's important to test for this because an empty string will fool the next steps.
Nokogiri::XML('').errors.empty?
=> true
Nokogiri also has the errors method, which will return an array of errors after attempting to parse a document that is malformed. Testing that for any size would help:
Nokogiri::XML('<foo>').errors
=> [#<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>]
Nokogiri::XML('<foo>').errors.empty?
=> false
Nokogiri::XML(string).errors.empty?
=> true
would be true if the document is syntactically valid.
I just tested Nokogiri to see if it could tell the difference between a regular string vs. true XML:
[2] (pry) main: 0> doc = Nokogiri::XML('foo').errors
[
[0] #<Nokogiri::XML::SyntaxError: Start tag expected, '<' not found>
]
So, you can loop through your files and sort them into XML and non-XML easily:
require 'nokogiri'
[
'',
'foo',
'<xml></xml>'
].group_by{ |s| (s.strip > '') && Nokogiri::XML(s).errors.empty? }
=> {false=>["", "foo"], true=>["<xml></xml>"]}
Assign the result of group_by to a variable, and you'll have a hash you can check for non-XML (false) or XML (true).
There is no such function in Ruby's String class or Active Support's String extensions, but you can use Nokogiri to detect errors in XML:
begin
bad_doc = Nokogiri::XML(badly_formed) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
puts "caught exception: #{e}"
end
Related
The regular way to safe load a typical single document YAML file is done by using YAML.safe_load(content).
YAML files can contain multiple documents:
---
key: value
---
key: !ruby/struct
foo: bar
Loading a YAML file such as this using YAML.safe_load(content) will only return the first document:
{ 'key' => 'value' }
If you split the file and try to safe_load the second document, you will get an exception as expected:
Psych::DisallowedClass (Tried to load unspecified class: Struct)
To load multiple documents you can use YAML.load_stream(content) which returns an array:
[
{ 'key' => 'value' },
{ 'key' => #<struct foo="bar"> }
]
The problem is that there is no YAML.safe_load_stream that would raise exceptions for non-whitelisted data types.
I wrote a workaround that utilizes the YAML.parse_stream interface:
Edit: Now as gem yaml-safe_load_stream. Also, the maintainers of Psych (the YAML in ruby stdlib) are looking into adding this feature to the library.
require 'yaml'
module YAML
def safe_load_stream(yaml, filename = nil, &block)
parse_stream(yaml, filename) do |stream|
raise_if_tags(stream, filename)
if block_given?
yield stream.to_ruby
else
stream.to_ruby
end
end
end
module_function :safe_load_stream
def raise_if_tags(obj, filename = nil, doc_num = 1)
doc_num += 1 if obj.is_a?(Psych::Nodes::Document)
if obj.respond_to?(:tag)
if tag = obj.tag
message = "tag #{tag} encountered on line #{obj.start_line} column #{obj.start_column} of document #{doc_num}"
message << " in file #{filename}" if filename
raise Psych::DisallowedClass, message
end
end
if obj.respond_to?(:children)
Array(obj.children).each do |child|
raise_if_tags(child, filename, doc_num)
end
end
end
module_function :raise_if_tags
private_class_method :raise_if_tags
end
With this you can do:
YAML.safe_load_stream(content, 'file.txt')
And get an exception:
Psych::DisallowedClass (Tried to load unspecified class: tag !ruby/struct
encountered on line 1 column 7 of document 2 in file file.txt)
The line numbers returned from .start_line are relative to the document start, I didn't find a way to get the line number where the document starts, so I added the document number to the error message.
It does not have the class and symbol whitelists and toggling of anchors/aliasing like the YAML.safe_load.
Also there are ways to use tags that will probably give a false positive with such a simplistic unless tag.nil? detection.
I'm using the engtagger gem to classify a sentence according to its parts of speech. The output I get is as follows:
puts text
# => "<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>"
I would have expected the gem to give me an array, but I guess I'll have to coerce this into an array myself.
What I'm eventually trying to get is a nested array something like this:
[["My", "nnp"], ["name", "nn"], ["is", "vbz"], ["Max", "nnp"]]
However I'm not really sure how to approach this with Nokogiri (or another parser library). Here's what I've tried:
(byebug) doc = Nokogiri::XML(text)
#<Nokogiri::XML::Document:0x3fd400286e78 name="document" children=[#<Nokogiri::XML::Element:0x3fd400286900 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd400286464 "My">]>]>
(byebug) Nokogiri.parse(text)
#<Nokogiri::XML::Document:0x3fd40028cd50 name="document" children=[#<Nokogiri::XML::Element:0x3fd40028c7d8 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd40028c378 "My">]>]>
So I've tried two different Nokogiri methods, but both are only showing the first node. How can I get the rest of the adjacent nodes as well?
Alternatively, how can I get the engtagger call to return an array? In the docs, I didn't find an example of how to return an array with all tags, only arrays with one specific kind of tag.
The main thing is that well-formed XML should have a root node. You were receiving the very first node only because it was treated as the root (that said, the topmost) node and as it was closed, Nokogiri considered the XML document to be ended.
Nokogiri::XML("<root>#{text}</root>").
children.first. # get root node
children.map { |e| [e.text, e.name] }. # map to what’s needed
reject { |e| e.last == 'text' } # filter out garbage
That filtering might be more semantically correct:
Nokogiri::XML("<root>#{text}</root>").
children.first.
children.reject { |e| Nokogiri::XML::Text === e }.
map { |e| [e.text, e.name] }
The problem is you're parsing the fragment incorrectly:
require 'nokogiri'
doc = Nokogiri::XML.fragment("<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>")
doc.to_xml # => "<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>"
Nokogiri wants valid XML, but you can get it to accept partial XML chunks using fragment.
At that point you're able to do:
doc.children.each_with_object([]){ |n, a| a << [n.text, n.name] unless n.text? }
# => [["My", "nnp"], ["name", "nn"], ["is", "vbz"], ["Max", "nnp"]]
Is there a simple method/way to check if a Nokogiri XML file has a proper root, like xml.valid? A way to check if the XML file contains specific content is very welcome as well.
I'm thinking of something like xml.valid? or xml.has_valid_root?. Thanks!
How are you going to determine what is a proper root?
<foo></foo>
has a proper root:
require 'nokogiri'
xml = '<foo></foo>'
doc = Nokogiri::XML(xml)
doc.root # => #<Nokogiri::XML::Element:0x3fd3a9471b7c name="foo">
Nokogiri has no way of determining that something else should have been the root. You might be able to test if you have foreknowledge of what the root node's name should be:
doc_root_ok = (doc.root.name == 'foo')
doc_root_ok # => true
You can see if the document parsed was well-formed (not needing any fixup), by looking at errors:
doc.errors # => []
If Nokogiri had to modify the document just to parse it, errors will return a list of changes that were made prior to parsing:
xml = '<foo><bar><bar></foo>'
doc = Nokogiri::XML(xml)
doc.errors # => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: bar line 1 and foo>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag bar line 1>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>]
A common and useful pattern is
doc = Nokogiri::XML(xml) do |config|
config.strict
end
This will throw a wobbly if the document is not well formed. I like to do this in order to prevent Nokogiri from being too kind to my XML.
I'd like to check if a string is valid YAML. I'd like to do this from within my Ruby code with a gem or library. I only have this begin/rescue clause, but it doesn't get rescued properly:
def valid_yaml_string?(config_text)
require 'open-uri'
file = open("https://github.com/TheNotary/the_notarys_linux_mint_postinstall_configuration")
hard_failing_bad_yaml = file.read
config_text = hard_failing_bad_yaml
begin
YAML.load config_text
return true
rescue
return false
end
end
I am unfortunately getting the terrible error of:
irb(main):089:0> valid_yaml_string?("b")
Psych::SyntaxError: (<unknown>): mapping values are not allowed in this context at line 6 column 19
from /home/kentos/.rvm/rubies/ruby-1.9.3-p374/lib/ruby/1.9.1/psych.rb:203:in `parse'
from /home/kentos/.rvm/rubies/ruby-1.9.3-p374/lib/ruby/1.9.1/psych.rb:203:in `parse_stream'
from /home/kentos/.rvm/rubies/ruby-1.9.3-p374/lib/ruby/1.9.1/psych.rb:151:in `parse'
from /home/kentos/.rvm/rubies/ruby-1.9.3-p374/lib/ruby/1.9.1/psych.rb:127:in `load'
from (irb):83:in `valid_yaml_string?'
from (irb):89
from /home/kentos/.rvm/rubies/ruby-1.9.3-p374/bin/irb:12:in `<main>'
Using a cleaned-up version of your code:
require 'yaml'
require 'open-uri'
URL = "https://github.com/TheNotary/the_notarys_linux_mint_postinstall_configuration"
def valid_yaml_string?(yaml)
!!YAML.load(yaml)
rescue Exception => e
STDERR.puts e.message
return false
end
puts valid_yaml_string?(open(URL).read)
I get:
(<unknown>): mapping values are not allowed in this context at line 6 column 19
false
when I run it.
The reason is, the data you are getting from that URL isn't YAML at all, it's HTML:
open('https://github.com/TheNotary/the_notarys_linux_mint_postinstall_configuration').read[0, 100]
=> " \n\n\n<!DOCTYPE html>\n<html>\n <head prefix=\"og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# githubog:"
If you only want a true/false response whether it's parsable YAML, remove this line:
STDERR.puts e.message
Unfortunately, going beyond that and determining if the string is a YAML string gets harder. You can do some sniffing, looking for some hints:
yaml[/^---/m]
will search for the YAML "document" marker, but a YAML file doesn't have to use those, nor do they have to be at the start of the file. We can add that in to tighten up the test:
!!YAML.load(yaml) && !!yaml[/^---/m]
But, even that leaves some holes, so adding in a test to see what the parser returns can help even more. YAML could return an Fixnum, String, an Array or a Hash, but if you already know what to expect, you can check to see what YAML wants to return. For instance:
YAML.load(({}).to_yaml).class
=> Hash
YAML.load(({}).to_yaml).instance_of?(Hash)
=> true
So, you could look for a Hash:
parsed_yaml = YAML.load(yaml)
!!yaml[/^---/m] && parsed_yaml.instance_of(Hash)
Replace Hash with whatever type you think you should get.
There might be even better ways to sniff it out, but those are what I'd try first.
Why can Ruby's built-in JSON not deserialize simple JSON primitives, and how do I work around it?
irb(main):001:0> require 'json'
#=> true
irb(main):002:0> objects = [ {}, [], 42, "", true, nil ]
#=> [{}, [], 42, "", true]
irb(main):012:0> objects.each do |o|
irb(main):013:1* json = o.to_json
irb(main):014:1> begin
irb(main):015:2* p JSON.parse(json)
irb(main):016:2> rescue Exception => e
irb(main):017:2> puts "Error parsing #{json.inspect}: #{e}"
irb(main):018:2> end
irb(main):019:1> end
{}
[]
Error parsing "42": 706: unexpected token at '42'
Error parsing "\"\"": 706: unexpected token at '""'
Error parsing "true": 706: unexpected token at 'true'
Error parsing "null": 706: unexpected token at 'null'
#=> [{}, [], 42, "", true, nil]
irb(main):020:0> RUBY_DESCRIPTION
#=> "ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.0]"
irb(main):022:0> JSON::VERSION
#=> "1.4.2"
RFC 4627: The application/json Media Type for JavaScript Object Notation (JSON) has this to say:
2. JSON Grammar
A JSON text is a sequence of tokens. The set of tokens includes six
structural characters, strings, numbers, and three literal names.
A JSON text is a serialized object or array.
JSON-text = object / array
[...]
2.1. Values
A JSON value MUST be an object, array, number, or string, or one of
the following three literal names:
false null true
If you call to_json on your six sample objects, we get this:
>> objects = [ {}, [], 42, "", true, nil ]
>> objects.map { |o| puts o.to_json }
{}
[]
42
""
true
null
So the first and second are valid JSON texts whereas the last four are not valid JSON texts even though they are valid JSON values.
JSON.parse wants what it calls a JSON document:
Parse the JSON document source into a Ruby data structure and return it.
Perhaps JSON document is the library's term for what RFC 4627 calls a JSON text. If so, then raising an exception is a reasonable response to an invalid input.
If you forcibly wrap and unwrap everything:
objects.each do |o|
json = o.to_json
begin
json_text = '[' + json + ']'
p JSON.parse(json_text)[0]
rescue Exception => e
puts "Error parsing #{json.inspect}: #{e}"
end
end
And as you note in your comment, using an array as the wrapper is better than an object in case the caller wants to use the :symbolize_names option. Wrapping like this means that you'll always be feeding JSON.parse a JSON text and everything should be fine.
This is quite an old question but I think it worths to have a proper answer to prevent hair loss for the ones who just encountered with the problem and still searching for a solution :)
To be able to parse "JSON primitives" with JSON gem below version 2, you can pass quirks_mode: true option like so;
JSON::VERSION # => 1.8.6
json_text = "This is a json primitive".to_json
JSON.parse(json_text, quirks_mode: true)
With the JSON gem version greater or equals to 2, the quirks_mode is not necessary anymore.
JSON::VERSION # => 2.0.0
json_text = "This is a json primitive".to_json
JSON.parse(json_text)
Before parsing the JSON, you can check the version of the JSON gem that you are using in your project with bundle show json or gem list | grep json and then use the corresponding one.
Happy JSON parsing!
It appears that the built-in JSON parser intentionally fails on anything but objects and arrays. My current workaround is the following:
# Work around a flaw in Ruby's built-in JSON parser
# not accepting anything but an object or array at the root level.
module JSON
def self.parse_any(str,opts={})
parse("[#{str}]",opts).first
end
end
Use JSON.load instead of JSON.parse to handle primitives:
e.g.
JSON.load('true') # => true
JSON.load('false') # => false
JSON.load('5150') # => 5150
JSON.load('null') # => nil
I think you are right...whether it is a bug or not, there is some wonky logic going on with the implementation. If it can parse arrays, and hashes it should be able to parse everything else.
Because JSON.parse seems geared for objects and arrays, I would try to pass your data one of those ways if you can, and if you can't, stick with the workaround you have.