Nokogiri Builder will omit xmlns attribute if already used - ruby

I'm trying to build an xml spreadsheet that contains styles that will be opened in excel.
This is my code:
res = Nokogiri::XML::Builder.new(encoding: 'UTF-8') do |xml|
xml.Workbook 'xmlns' => "urn:schemas-microsoft-com:office:spreadsheet",
'xmlns:o' => "urn:schemas-microsoft-com:office:office",
'xmlns:x' => "urn:schemas-microsoft-com:office:excel",
'xmlns:html' => "http://www.w3.org/TR/REC-html40",
'xmlns:ss' => "urn:schemas-microsoft-com:office:spreadsheet" do
xml.WorksheetOptions "xmlns" => "urn:schemas-microsoft-com:office:excel" do
xml.PageSetup do
xml.Layout "x:Orientation" => "Landscape"
xml.Header "x:Data" => "&LLeft side&CCenter&R&D &T"
xml.Footer "x:Data" => "&CPage: &P / &N"
end
xml.Unsynced
xml.FitToPage
xml.Print do
xml.FitHeight 20
xml.ValidPrinterInfo
xml.Scale 90
xml.HorizontalResolution -4
xml.VerticalResolution -4
end
xml.Zoom 125
xml.PageLayoutZoom 0
xml.Selected
xml.Panes do
xml.Pane do
xml.Number 3
xml.ActiveRow 8
xml.ActiveCol 4
end
end
xml.ProtectObjects "False"
xml.ProtectScenarios "False"
xml.AllowFormatCells
xml.AllowSizeCols
xml.AllowSizeRows
xml.AllowSort
xml.AllowFilter
xml.AllowUsePivotTables
end
end
end.to_xml
puts res
I had this as a working template for years (I was using bunlder's builder before which now is just too slow) and now that I switched to Nokogiri it's not working anymore. Basically this: "xmlns" => "urn:schemas-microsoft-com:office:excel" in the WorksheetOptions tag get's ignored and is not added to the document. Here is the actual result:
<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<WorksheetOptions>
<PageSetup>
<Layout x:Orientation="Landscape"/>
<Header x:Data="&LLeft side&CCenter&R&D &T"/>
<Footer x:Data="&CPage: &P / &N"/>
</PageSetup>
<Unsynced/>
<FitToPage/>
<Print>
<FitHeight>20</FitHeight>
<ValidPrinterInfo/>
<Scale>90</Scale>
<HorizontalResolution>-4</HorizontalResolution>
<VerticalResolution>-4</VerticalResolution>
</Print>
<Zoom>125</Zoom>
<PageLayoutZoom>0</PageLayoutZoom>
<Selected/>
<Panes>
<Pane>
<Number>3</Number>
<ActiveRow>8</ActiveRow>
<ActiveCol>4</ActiveCol>
</Pane>
</Panes>
<ProtectObjects>False</ProtectObjects>
<ProtectScenarios>False</ProtectScenarios>
<AllowFormatCells/>
<AllowSizeCols/>
<AllowSizeRows/>
<AllowSort/>
<AllowFilter/>
<AllowUsePivotTables/>
</WorksheetOptions>
</Workbook>
If I write anything else as the xmlns attribute on this line xml.WorksheetOptions "xmlns" => "urn:schemas-microsoft-com:office:excel" do it will work and get added correctly to the document.
This is wrong, apparently excel will not set the page properly if that attribute is missing. Is this a correct behavior for Nokogiri?
If it is, is there any other way to make excel apply the correct page layout to the document?
This happens with another tag that I didn't include in the example, otherwise it would have been too long. This is the other one: xml.DocumentProperties("xmlns" => "urn:schemas-microsoft-com:office:office") do.

I'm not sure about the builder interface, but you can always add it directly to a node by using add_namespace with a nil namespace:
node.add_namespace(nil, "urn:schemas-microsoft-com:office:excel")
See the documentation on Node#add_namespace for details.

Related

How to add a comment before XML root node, in Nokogiri?

This is what I'm doing:
xml = Nokogiri::XML('<hello/>')
xml.root.add_previous_sibling(
Nokogiri::XML::Comment.new(
xml, '<!-- how are you? -->'
)
)
This is what I'm trying to achieve:
<?xml version="1.0"?>
<!-- how are you? -->
<hello/>
I'm getting:
ArgumentError: A document may not have multiple root nodes.
What is the right way?
Comment should be added inside xml.children NodeSet.
Here is an example:
xml = Nokogiri::XML('<hello/>')
=> #<Nokogiri::XML::Document:0x3fe1db8d0ed0 name="document" children=[#<Nokogiri::XML::Element:0x3fe1db8d0584 name="hello">]>
xml.children.before(Nokogiri::XML::Comment.new(xml, 'how are you?'))
=> #<Nokogiri::XML::Element:0x3fe1db8d0584 name="hello">
xml.to_s
=> "<?xml version=\"1.0\"?>\n<!--how are you?-->\n<hello/>\n"

How to filter XML elements by date range in Ruby

I typically use Nokogiri as my XML parser.
I have the following XML:
<albums>
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<classix_nouveaux album="Night People"/>
<release_date value="19820501"/>
</classix_nouveaux>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
</albums>
I want to get all albums that were released between 1/1/1980 and 4/15/1982:
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
How do I filter/query the XML by a release_date range?
Your XML is malformed. After parsing, here's what Nokogiri has to say about it:
doc.errors
# => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: albums line 1 and classix_nouveaux>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>]
That's because:
<classix_nouveaux album="Night People"/>
and
<engligh_beat album="I Just Can't Stop It"/>
are terminated. Instead they should be:
<classix_nouveaux album="Night People">
and
<engligh_beat album="I Just Can't Stop It">
You can use CSS or XPath selectors to find exact matches, or even sub-string matches, but neither CSS or XPath understand "ranges" of dates, nor do they have an idea of what a Date is, so you'd have to extract all nodes, convert the date value into a Date object or integer in this case, then compare to the range:
date_range = 19800501..19820401
selected_albums = doc.search('//release_date').select { |rd| date_range.include?(rd['value'].to_i) }.map { |rd| rd.parent }
selected_albums.map(&:to_xml)
# => ["<aldo_nova album=\"aldo nova\">\n" +
# " <release_date value=\"19820401\"/>\n" +
# "</aldo_nova>",
# "<engligh_beat album=\"I Just Can't Stop It\">\n" +
# " <release_date value=\"19800501\"/>\n" +
# "</engligh_beat>"]
I think your XML is poorly designed because you have varying tag names for what should be an album. <album> should be a child of <albums>. I'd recommend something like this:
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
Once the XML is in a standard form, then it becomes easier to navigate and search:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
EOT
doc.search('album').last['title'] # => "I Just Can't Stop It"
band = 'aldo nova'
doc.search("//album[#band='#{band}']").map { |a| a['title'] } # => ["aldo nova"]
and searching for dates becomes more straightforward because it's not necessary to find the parent of the node:
date_range = 19800501..19820401
selected_albums = doc.search('album').select { |a| date_range.include?(a['release_date'].to_i) }
selected_albums.map(&:to_xml)
# => ["<album band=\"aldo nova\" title=\"aldo nova\" release_date=\"19820401\"/>",
# "<album band=\"english beat\" title=\"I Just Can't Stop It\" release_date=\"19800501\"/>"]
I'd recommend reading some tutorials on XML itself as it's easy to paint ourselves into corners if the data isn't represented logically and correctly.

How to pull data from tags based on other tags

I have the following example document:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>Doe</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>DOE</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>222222222</irs:SSN>
</EmployeeInfoGrp>
</Form1095CUpstreamDetail>
</n1:Form109495CTransmittalUpstream>
Using Nokogiri I want to extract the value between the <PersonFirstNm>, <PersonLastNm> and <irs:SSN> for each <Form1095CUpstreamDetail> based on the <RecordId>.
I tried removing namespaces as well. I posted a small snippet, but I have tried many iterations of working through the XML with no success. This is my first time using XML, so I realize I am likely missing something easy.
When I set my XPath:
require 'nokogiri'
submission_doc = Nokogiri::XML(open('1094C_Request.xml'))
submissions = submission_doc.remove_namespaces
nodes = submission.xpath('//Form1095CUpstreamDetail')
I do not seem to have any association between the RecordId and the tags mentioned above, and I am stuck on where to go next.
The fields are not listed as children for the RecordId, so I can't think of how to approach obtaining their values. I am including the full document as an example to make sure I am not excluding anything.
I have an array of values, and I would like to pull the three tags mentioned above if the RecordId is contained within the array of numbers.
Nokogiri makes it pretty easy to do what you want (assuming the XML is syntactically correct). I'd do something like:
require 'nokogiri'
require 'pp'
doc = Nokogiri::XML(<<EOT)
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonLastNm>Doe</PersonLastNm>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonLastNm>DOE</PersonLastNm>
<irs:SSN>222222222</irs:SSN>
</Form1095CUpstreamDetail>
</Form109495CTransmittalUpstream>
EOT
info = doc.search('Form1095CUpstreamDetail').map{ |form|
{
record_id: form.at('RecordId').text,
person_first_nm: form.at('PersonFirstNm').text,
person_last_nm: form.at('PersonLastNm').text,
ssn: form.at('irs|SSN').text
}
}
pp info
# >> [{:record_id=>"1",
# >> :person_first_nm=>"JOHN",
# >> :person_last_nm=>"Doe",
# >> :ssn=>"123456790"},
# >> {:record_id=>"2",
# >> :person_first_nm=>"JANE",
# >> :person_last_nm=>"DOE",
# >> :ssn=>"222222222"}]
While it's possible to do this with XPath, Nokogiri's implementation of CSS selectors tends to result in more easily read selectors, which translates to easier to maintain, which is a very good thing.
You'll see the use of | in 'irs|SSN' which is Nokogiri's way of defining a namespace for CSS. This is documented in "Namespaces".
First of all the xml validator reports error
The default (no prefix) Namespace URI for XPath queries is always '' and it cannot be redefined to 'urn:us:gov:treasury:irs:ext:aca:air:7.0'.
so you must set this default xmlns to "".
You can use this code.
require 'nokogiri'
doc = Nokogiri::XML(open('1094C_Request.xml'))
doc.namespaces['xmlns'] = ''
details = doc.xpath("//:Form1095CUpstreamDetail")
elem_a = ["PersonFirstNm", "PersonLastNm", "irs:SSN"]
output = details.each_with_object({}) do |element, exp|
exp[element.xpath("./:RecordId").text] = elem_a.each_with_object({}) do |elem_n, exp_h|
exp_h[elem_n] = element.xpath(".//#{elem_n.include?(':') ? elem_n : ":#{elem_n}"}").text
end
end
output
p output
# {
# "1" => {"PersonFirstNm" => "JOHN", "PersonLastNm" => "Doe", "irs:SSN" => "123456790"},
# "2" => {"PersonFirstNm" => "JANE", "PersonLastNm" => "DOE", "irs:SSN" => "222222222"}
# }
I hope this helps

Generating KML files with Ruby

I'm using the ruby_kml gem right now to try to generate KML from some data in my model.
I also tried georuby.
Both of them, when they generate XML it seems to be coming back escaped like this:
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<kml xmlns=\"http://earth.google.com/kml/2.1\">\n <Folder>\n <name>San Francisco</name>\n <LineStyle>\n <color>#0D7215</color>\n </LineStyle>\n <Placemark>\n <name>21 Google Bus</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates>37.784282779035216,-122.42228507995605 37.784144999999995,-122.42225699999999,37.784084,-122.42274499999999,37.785472,-122.423023,37.785391,-122.423564,37.785364,-122.423839,37.785418,-122.424714,37.785410999999996,-122.42497999999999,37.785391,-122.42522,37.784839,-122.42956,37.784631,-122.431297,37.782576,-122.43086799999999,37.776969,-122.42975399999999,37.776759999999996,-122.431384,37.776368,-122.431305 37.776368,-122.431305,37.777699999999996,-122.431575,37.778746999999996,-122.42335399999999,37.773609,-122.42231199999999,37.773013999999996,-122.42222799999999,37.772974999999995,-122.42222799999999,37.772915,-122.42226799999999,37.772774,-122.422446,37.772636999999996,-122.422585,37.772562,-122.42263399999999,37.772521999999995,-122.422643,37.771588,-122.42253799999999,37.771631,-122.421759</coordinates>\n </LineString>\n </Placemark>\n <LineStyle>\n <color>#0071CA</color>\n </LineStyle>\n <Placemark>\n <name>45 Inverter</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates>37.792490234462946,-122.40863800048828 37.792516,-122.408429,37.793068,-122.408541,37.792957,-122.409357,37.792051,-122.409189,37.788289999999996,-122.40841499999999,37.785495,-122.407866,37.785713,-122.406229,37.785713,-122.40591599999999,37.785699,-122.40576999999999,37.785658,-122.40568999999999,37.783249999999995,-122.40270699999999,37.778850999999996,-122.40827499999999,37.779104,-122.408577</coordinates>\n </LineString>\n </Placemark>\n <LineStyle>\n <color>#AD0101</color>\n </LineStyle>\n <Placemark>\n <name>82 X Wing</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates></coordinates>\n </LineString>\n </Placemark>\n <LineStyle>\n <color>#AD0101</color>\n </LineStyle>\n <Placemark>\n <name>93 X Wing</name>\n <description>\n <![CDATA[Click to add description.]]>\n </description>\n <LineString>\n <coordinates></coordinates>\n </LineString>\n </Placemark>\n </Folder>\n</kml>\n"
I'm not sure why it should be coming it escaped, since it definitely is not valid XML.
georuby does the same.
Does anyone know why it's coming out escaped and also how to unescape it?
Here's the code I'm using:
map = self;
kml = KMLFile.new
folder = KML::Folder.new(:name => map[:name])
map.lines.each do |line|
folder.features << KML::LineStyle.new(
color: line.color,
)
folder.features << KML::Placemark.new(
:name => line.name,
:geometry => KML::LineString.new(:coordinates => line.coordinates),
:description => line.description
)
end
kml.objects << folder
kml.render
Thanks!!!

SimpleXML Reading node with a hyphenated name

I have the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<gnm:Workbook xmlns:gnm="http://www.gnumeric.org/v10.dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gnumeric.org/v9.xsd">
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:ooo="http://openoffice.org/2004/office" office:version="1.1">
<office:meta>
<dc:creator>Mark Baker</dc:creator>
<dc:date>2010-09-01T22:49:33Z</dc:date>
<meta:creation-date>2010-09-01T22:48:39Z</meta:creation-date>
<meta:editing-cycles>4</meta:editing-cycles>
<meta:editing-duration>PT00H04M20S</meta:editing-duration>
<meta:generator>OpenOffice.org/3.1$Win32 OpenOffice.org_project/310m11$Build-9399</meta:generator>
</office:meta>
</office:document-meta>
</gnm:Workbook>
And am trying to read the office:document-meta node to extractthe various elements below it (dc:creator, meta:creation-date, etc.)
The following code:
$xml = simplexml_load_string($gFileData);
$namespacesMeta = $xml->getNamespaces(true);
$officeXML = $xml->children($namespacesMeta['office']);
var_dump($officeXML);
echo '<hr />';
gives me:
object(SimpleXMLElement)[91]
public 'document-meta' =>
object(SimpleXMLElement)[93]
public '#attributes' =>
array
'version' => string '1.1' (length=3)
public 'meta' =>
object(SimpleXMLElement)[94]
but if I try to read the document-meta element using:
$xml = simplexml_load_string($gFileData);
$namespacesMeta = $xml->getNamespaces(true);
$officeXML = $xml->children($namespacesMeta['office']);
$docMeta = $officeXML->document-meta;
var_dump($docMeta);
echo '<hr />';
I get
Notice: Use of undefined constant meta - assumed 'meta' in /usr/local/apache/htdocsNewDev/PHPExcel/Classes/PHPExcel/Reader/Gnumeric.php on line 273
int 0
I assume that SimpleXML is trying to extract a non-existent node "document" from $officeXML, then subtract the value of (non-existent) constant "meta", resulting in forcing the integer 0 result rather than the document-meta node.
Is there a way to resolve this using SimpleXML, or will I be forced to rewrite using XMLReader? Any help appreciated.
Your assumption is correct. Use
$officeXML->{'document-meta'}
to make it work.
Please note that the above applies to Element nodes. Attribute nodes (those within the #attributes property when dumping the SimpleXmlElement) do not require any special syntax to be accessed when hyphenated. They are regularly accessible via array notation, e.g.
$xml = <<< XML
<root>
<hyphenated-element hyphenated-attribute="bar">foo</hyphenated-element>
</root>
XML;
$root = new SimpleXMLElement($xml);
echo $root->{'hyphenated-element'}; // prints "foo"
echo $root->{'hyphenated-element'}['hyphenated-attribute']; // prints "bar"
See the SimpleXml Basics in the Manual for further examples.
I assume the best way to do it is to cast to array:
Consider the following XML:
<subscribe hello-world="yolo">
<callback-url>example url</callback-url>
</subscribe>
You can access members, including attributes, using a cast:
<?php
$xml = (array) simplexml_load_string($input);
$callback = $xml["callback-url"];
$attribute = $xml['#attributes']['hello-world'];
It makes everything easier. Hope I helped.

Resources