HTML Entity problems using Nokogiri::XML.fragment - ruby

it seems that all entities are killed using
tags = "<p>test umlauts รถ</p>"
Nokogiri::XML.fragment(tags)
Result:
<p>test umlauts </p>
The above method calls Nokogiri::XML::DocumentFragment.parse(tags) and that methods calls
Nokogiri::XML::DocumentFragment.new(XML::Document.new, tags).
In relation to the nokogiri documentation this code will be executed:
def initialize document, tags=nil
if tags
parser = if self.kind_of?(Nokogiri::HTML::DocumentFragment)
HTML::SAX::Parser.new(FragmentHandler.new(self, tags))
else
XML::SAX::Parser.new(FragmentHandler.new(self, tags))
end
parser.parse(tags)
end
end
I think we are dealing with the XML::SAX::Parser and the corresponding FragmentHandler. Digging around the code gives no hint; which parameters do I have to set to get the correct result?

oouml is not a predefined entity in XML. If you want to allow the HTML entity references in XHTML you'd need to use a parser that read the external DTD in the doctype. This is a lot of effort; you may prefer to just use the HTML parser if you have HTML-compatible XHTML with entity references.

Related

Loading non valid xml in Calabash / xproc

I am trying to create a routine for validating xhtml documents. I use xproc which I run in Calabash. In is an xhtml document. This document may not be valid.
For testing I edit an xhtml document. I simply delete a and introduce an error. This is the error I hope to detect when I validate the document.
In order to validate I use , I supply a schema, and I output the result of the validation in a new file.
But if the input file is not valid in the first place xproc/Calabash stops. The error message is basically an error message from Saxon pointing out that the is missing. But I wanted the validation output in my output file. How do I do that?
<p:input port="source" primary="true"/>
<p:load name="xml-doc" href="'input.xhtml'"/>
<p:validate-with-xml-schema name="validate">
<p:input port="source">
<p:pipe port="result" step="xml-doc"/>
</p:input>
<p:with-option name="assert-valid" select="'false'"/>
<p:with-option name="mode" select="'lax'"/>
<p:input port="schema">
<p:document href="xhtml-schema.xsd"/>
</p:input>
</p:validate-with-xml-schema>
<p:store name="valid-store">
<p:input port="source">
<p:pipe port="result" step="validate"/>
</p:input>
<p:with-option name="href" select="'output.xml'"/>
</p:store>
From your question it's not clear what exactly is Saxon complaining about, but I assume that it cannot find the input XHTML file. You wrap the file name in single quotes in p:load/#href, which is not correct. When you use the attribute-based shortcut form for options, the value of the attribute is taken as-is and is not interpreted as an XPath expression (which is the case when you use the long p:with-option form).

Extracting value from complex hash in Ruby

I am using an API (zillow) which returns a complex hash. A sample result is
{"xmlns:xsi"=>"http://www.w3.org/2001/XMLSchema-instance",
"xsi:schemaLocation"=>"http://www.zillow.com/static/xsd/SearchResults.xsd http://www.zillowstatic.com/vstatic/5985ee4/static/xsd/SearchResults.xsd",
"xmlns:SearchResults"=>"http://www.zillow.com/static/xsd/SearchResults.xsd", "request"=>[{"address"=>["305 Vinton St"], "citystatezip"=>["Melrose, MA 02176"]}],
"message"=>[{"text"=>["Request successfully processed"], "code"=>["0"]}],
"response"=>[{"results"=>[{"result"=>[{"zpid"=>["56291382"], "links"=>[{"homedetails"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/"],
"graphsanddata"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/#charts-and-data"], "mapthishome"=>["http://www.zillow.com/homes/56291382_zpid/"],
"comparables"=>["http://www.zillow.com/homes/comps/56291382_zpid/"]}], "address"=>[{"street"=>["305 Vinton St"], "zipcode"=>["02176"], "city"=>["Melrose"], "state"=>["MA"], "latitude"=>["42.466805"],
"longitude"=>["-71.072515"]}], "zestimate"=>[{"amount"=>[{"currency"=>"USD", "content"=>"562170"}], "last-updated"=>["06/01/2014"], "oneWeekChange"=>[{"deprecated"=>"true"}], "valueChange"=>[{"duration"=>"30", "currency"=>"USD", "content"=>"42749"}], "valuationRange"=>[{"low"=>[{"currency"=>"USD",
"content"=>"534062"}], "high"=>[{"currency"=>"USD", "content"=>"590278"}]}], "percentile"=>["0"]}], "localRealEstate"=>[{"region"=>[{"id"=>"23017", "type"=>"city",
"name"=>"Melrose", "links"=>[{"overview"=>["http://www.zillow.com/local-info/MA-Melrose/r_23017/"], "forSaleByOwner"=>["http://www.zillow.com/melrose-ma/fsbo/"],
"forSale"=>["http://www.zillow.com/melrose-ma/"]}]}]}]}]}]}]}
I can extract a specific value using the following:
result = result.to_hash
p result["response"][0]["results"][0]["result"][0]["zestimate"][0]["amount"][0]["content"]
It seems odd to have to specify the index of each element in this fashion. Is there a simpler way to obtain a named value?
It looks like this should be parsed into XML. According to the Zillow API Docs, it returns XML by default. Apparently, "to_hash" was able to turn this into a hash (albeit, a very ugly one), but you are really trying to swim upstream by using it this way. I would recommend using it as intended (xml) at the start, and then maybe parsing it into an easier to use format (like a JSON/Hash structure) later.
Nokogiri is GREAT at parsing XML! You can use the xpath syntax for grabbing elements from the dom, or even css selectors.
For example, to get an array of the "content" in every result:
response = #get xml response from zillow
results = Nokogiri::XML(response).remove_namespaces!
#using css
content_array = results.css("result content")
#same thing using xpath:
content_array = results.xpath("//result//content")
If you just want the content from the first result, you can do this as a shortcut:
content = results.at_css("result content").content
Since it is indeed XML dumped into a JSON, you could use JSONPath to query the JSON

XML Namespace issue with Nokogiri

I have the following XML:
<body>
<hello xmlns='http://...'>
<world>yes</world>
</hello>
</body>
When I load that into a Nokogiri XML document, and call document.at_css "world", I receive nil back. But when I remove the namespace for hello, it works perfectly. I know I can call document.remove_namespaces!, but why is it that it will not work with the namespace?
Because Nokogiri requires you to register the XML namespaces you are querying within (read more about XML Namespaces). But you should still be able to query the element if you specify its namespace when calling at_css. To see the exact usage, check out the css method documentation. It should end up looking something like this:
document.at_css "world", 'namespace_name' => 'namespace URI'

Building a class name using Data in a haml/ruby project

I currently have some haml code which reads as
%span.flagb.flag-gb
this builds me a nice span which the classes:
flagB
flag-gb
(which puts a nice sprite on the page of the gb (great britian) flag
Now I dont want to hard code the gb I have the iso country code which I can access with a
=code
but I am so new I dont know about the best way of replacing the "gb" with the code value
Full code below as how i have it atm
- TZInfo::Country.all_codes.each do |code|
%li
%a(href='#')
%span.flagb.flag-gb
=code
only way I have managed it so far is using pure html
<span class='flagB flag-#{code'></span>
Thanks
The .classname syntax is just a shorthand, you can do it the long way:
%span{:class => "flagb flag-#{code}"}
See the HAML reference on class and id attributes for more information.

How to parse html source code with ruby/nokogiri?

I've successfully used ruby (1.8) and nokogiri's css parsing to pull out front facing data from web pages.
However I now need to pull out some data from a series of pages where the data is in the "meta" tags in the source code of the page.
One of the lines I need is the following:
<meta name="geo.position" content="35.667459;139.706256" />
I've tried using xpath put haven't been able to get it right.
Any help as to what syntax is needed would be much appreciated.
Thanks
This is a good case for a CSS attribute selector. For example:
doc.css('meta[name="geo.position"]').each do |meta_tag|
puts meta_tag['content'] # => 35.667459;139.706256
end
The equivalent XPath expression is almost identical:
doc.xpath('//meta[#name = "geo.position"]').each do |meta_tag|
puts meta_tag['content'] # => 35.667459;139.706256
end
require 'nokogiri'
doc = Nokogiri::HTML('<meta name="geo.position" content="35.667459;139.706256" />')
doc.at('//meta[#name="geo.position"]')['content'] # => "35.667459;139.706256"

Resources