SimpleXML Reading node with a hyphenated name - simplexml

I have the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<gnm:Workbook xmlns:gnm="http://www.gnumeric.org/v10.dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gnumeric.org/v9.xsd">
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:ooo="http://openoffice.org/2004/office" office:version="1.1">
<office:meta>
<dc:creator>Mark Baker</dc:creator>
<dc:date>2010-09-01T22:49:33Z</dc:date>
<meta:creation-date>2010-09-01T22:48:39Z</meta:creation-date>
<meta:editing-cycles>4</meta:editing-cycles>
<meta:editing-duration>PT00H04M20S</meta:editing-duration>
<meta:generator>OpenOffice.org/3.1$Win32 OpenOffice.org_project/310m11$Build-9399</meta:generator>
</office:meta>
</office:document-meta>
</gnm:Workbook>
And am trying to read the office:document-meta node to extractthe various elements below it (dc:creator, meta:creation-date, etc.)
The following code:
$xml = simplexml_load_string($gFileData);
$namespacesMeta = $xml->getNamespaces(true);
$officeXML = $xml->children($namespacesMeta['office']);
var_dump($officeXML);
echo '<hr />';
gives me:
object(SimpleXMLElement)[91]
public 'document-meta' =>
object(SimpleXMLElement)[93]
public '#attributes' =>
array
'version' => string '1.1' (length=3)
public 'meta' =>
object(SimpleXMLElement)[94]
but if I try to read the document-meta element using:
$xml = simplexml_load_string($gFileData);
$namespacesMeta = $xml->getNamespaces(true);
$officeXML = $xml->children($namespacesMeta['office']);
$docMeta = $officeXML->document-meta;
var_dump($docMeta);
echo '<hr />';
I get
Notice: Use of undefined constant meta - assumed 'meta' in /usr/local/apache/htdocsNewDev/PHPExcel/Classes/PHPExcel/Reader/Gnumeric.php on line 273
int 0
I assume that SimpleXML is trying to extract a non-existent node "document" from $officeXML, then subtract the value of (non-existent) constant "meta", resulting in forcing the integer 0 result rather than the document-meta node.
Is there a way to resolve this using SimpleXML, or will I be forced to rewrite using XMLReader? Any help appreciated.

Your assumption is correct. Use
$officeXML->{'document-meta'}
to make it work.
Please note that the above applies to Element nodes. Attribute nodes (those within the #attributes property when dumping the SimpleXmlElement) do not require any special syntax to be accessed when hyphenated. They are regularly accessible via array notation, e.g.
$xml = <<< XML
<root>
<hyphenated-element hyphenated-attribute="bar">foo</hyphenated-element>
</root>
XML;
$root = new SimpleXMLElement($xml);
echo $root->{'hyphenated-element'}; // prints "foo"
echo $root->{'hyphenated-element'}['hyphenated-attribute']; // prints "bar"
See the SimpleXml Basics in the Manual for further examples.

I assume the best way to do it is to cast to array:
Consider the following XML:
<subscribe hello-world="yolo">
<callback-url>example url</callback-url>
</subscribe>
You can access members, including attributes, using a cast:
<?php
$xml = (array) simplexml_load_string($input);
$callback = $xml["callback-url"];
$attribute = $xml['#attributes']['hello-world'];
It makes everything easier. Hope I helped.

Related

How to add a comment before XML root node, in Nokogiri?

This is what I'm doing:
xml = Nokogiri::XML('<hello/>')
xml.root.add_previous_sibling(
Nokogiri::XML::Comment.new(
xml, '<!-- how are you? -->'
)
)
This is what I'm trying to achieve:
<?xml version="1.0"?>
<!-- how are you? -->
<hello/>
I'm getting:
ArgumentError: A document may not have multiple root nodes.
What is the right way?
Comment should be added inside xml.children NodeSet.
Here is an example:
xml = Nokogiri::XML('<hello/>')
=> #<Nokogiri::XML::Document:0x3fe1db8d0ed0 name="document" children=[#<Nokogiri::XML::Element:0x3fe1db8d0584 name="hello">]>
xml.children.before(Nokogiri::XML::Comment.new(xml, 'how are you?'))
=> #<Nokogiri::XML::Element:0x3fe1db8d0584 name="hello">
xml.to_s
=> "<?xml version=\"1.0\"?>\n<!--how are you?-->\n<hello/>\n"

How to pull data from tags based on other tags

I have the following example document:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>Doe</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>DOE</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>222222222</irs:SSN>
</EmployeeInfoGrp>
</Form1095CUpstreamDetail>
</n1:Form109495CTransmittalUpstream>
Using Nokogiri I want to extract the value between the <PersonFirstNm>, <PersonLastNm> and <irs:SSN> for each <Form1095CUpstreamDetail> based on the <RecordId>.
I tried removing namespaces as well. I posted a small snippet, but I have tried many iterations of working through the XML with no success. This is my first time using XML, so I realize I am likely missing something easy.
When I set my XPath:
require 'nokogiri'
submission_doc = Nokogiri::XML(open('1094C_Request.xml'))
submissions = submission_doc.remove_namespaces
nodes = submission.xpath('//Form1095CUpstreamDetail')
I do not seem to have any association between the RecordId and the tags mentioned above, and I am stuck on where to go next.
The fields are not listed as children for the RecordId, so I can't think of how to approach obtaining their values. I am including the full document as an example to make sure I am not excluding anything.
I have an array of values, and I would like to pull the three tags mentioned above if the RecordId is contained within the array of numbers.
Nokogiri makes it pretty easy to do what you want (assuming the XML is syntactically correct). I'd do something like:
require 'nokogiri'
require 'pp'
doc = Nokogiri::XML(<<EOT)
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonLastNm>Doe</PersonLastNm>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonLastNm>DOE</PersonLastNm>
<irs:SSN>222222222</irs:SSN>
</Form1095CUpstreamDetail>
</Form109495CTransmittalUpstream>
EOT
info = doc.search('Form1095CUpstreamDetail').map{ |form|
{
record_id: form.at('RecordId').text,
person_first_nm: form.at('PersonFirstNm').text,
person_last_nm: form.at('PersonLastNm').text,
ssn: form.at('irs|SSN').text
}
}
pp info
# >> [{:record_id=>"1",
# >> :person_first_nm=>"JOHN",
# >> :person_last_nm=>"Doe",
# >> :ssn=>"123456790"},
# >> {:record_id=>"2",
# >> :person_first_nm=>"JANE",
# >> :person_last_nm=>"DOE",
# >> :ssn=>"222222222"}]
While it's possible to do this with XPath, Nokogiri's implementation of CSS selectors tends to result in more easily read selectors, which translates to easier to maintain, which is a very good thing.
You'll see the use of | in 'irs|SSN' which is Nokogiri's way of defining a namespace for CSS. This is documented in "Namespaces".
First of all the xml validator reports error
The default (no prefix) Namespace URI for XPath queries is always '' and it cannot be redefined to 'urn:us:gov:treasury:irs:ext:aca:air:7.0'.
so you must set this default xmlns to "".
You can use this code.
require 'nokogiri'
doc = Nokogiri::XML(open('1094C_Request.xml'))
doc.namespaces['xmlns'] = ''
details = doc.xpath("//:Form1095CUpstreamDetail")
elem_a = ["PersonFirstNm", "PersonLastNm", "irs:SSN"]
output = details.each_with_object({}) do |element, exp|
exp[element.xpath("./:RecordId").text] = elem_a.each_with_object({}) do |elem_n, exp_h|
exp_h[elem_n] = element.xpath(".//#{elem_n.include?(':') ? elem_n : ":#{elem_n}"}").text
end
end
output
p output
# {
# "1" => {"PersonFirstNm" => "JOHN", "PersonLastNm" => "Doe", "irs:SSN" => "123456790"},
# "2" => {"PersonFirstNm" => "JANE", "PersonLastNm" => "DOE", "irs:SSN" => "222222222"}
# }
I hope this helps

Using a CSV file to insert values using Ruby

I have some sample code I can execute for our Nexpose server and I need to do some mass asset tagging. Here is an example of the code.
nsc = Nexpose::Connection.new('your_nexpose_instance', 'username', 'password', 3780)
nsc.login
criterion = Nexpose::Tag::Criterion.new('IP_RANGE', 'IN', ['ip1', 'ip2'])
criteria = Nexpose::Tag::Criteria.new(criterion)
tag = Nexpose::Tag.new("tagname", Nexpose::Tag::Type::Generic::CUSTOM)
tag.search_criteria = criteria
tag.save(nsc)
I have a file called with the following data.
ip1,ip2,tagname
192.168.1.1,192.168.1.255,Workstations
How would I go about running a for loop and using the CSV to quickly process the above code? I have no experiance with Ruby and tried to follow some example but I'm confused at this point.
There's a CSV library in Ruby's standard lib collection that you can use.
Basic example based on your code example and data, not tested:
require 'csv'
nsc = Nexpose::Connection.new('your_nexpose_instance', 'username', 'password', 3780)
nsc.login
CSV.foreach("path/to/file.csv", headers: true) do |row|
criterion = Nexpose::Tag::Criterion.new('IP_RANGE', 'IN', [row['ip1'], row['ip2'])
criteria = Nexpose::Tag::Criteria.new(criterion)
tag = Nexpose::Tag.new(row['tagname'], Nexpose::Tag::Type::Generic::CUSTOM)
tag.search_criteria = criteria
tag.save(nsc)
end
I made a directory with input.csv and main.rb
input.csv
ip1,ip2,tagname
192.168.1.1,192.168.1.255,Workstations
main.rb
require "csv"
CSV.foreach("input.csv", headers: true) do |row|
puts "ip1: #{row['ip1']}"
puts "ip2: #{row['ip2']}"
puts "tagname: #{row['tagname']}"
end
the output is
ip1: 192.168.1.1
ip2: 192.168.1.255
tagname: Workstations
I hope this can help. If you have questions I'm here :)
If you just need to loop through each line of the file and fire that chunk of code for each line, you could do something like this:
file = Net::HTTP.get(URI(<whatever_your_file_name_is>))
index = 0
file.each_line do |line|
next if index == 0
index += 1
split_line = line.split(',')
ip1 = split_line[0]
ip2 = split_line[1]
tagname = split_line[2]
nsc = Nexpose::Connection.new('your_nexpose_instance', 'username', 'password', 3780)
nsc.login
criterion = Nexpose::Tag::Criterion.new('IP_RANGE', 'IN', [ip1, ip2])
criteria = Nexpose::Tag::Criteria.new(criterion)
tag = Nexpose::Tag.new(tagname, Nexpose::Tag::Type::Generic::CUSTOM)
tag.search_criteria = criteria
tag.save(nsc)
end
NOTE: This code example is assuming that the CSV file is stored remotely, not locally.
ALSO: In case you're wondering, the next if index == 0 is there to skip your header record.
UPDATE
To use this approach for a local file, you can use File.open() instead of Net::HTTP.get(), like so:
file = File.open(<whatever_your_file_name_is>).read
Two things to note:
Make sure you use the fully-qualified name of the file - i.e. ~/folder/folder/filename.csv instead of just filename.csv.
If the files you're going to be loading are enormous, this might not be an ideal approach because it's actually reading the whole file into memory. But considering your file only has 3 columns, you'd have to have an extreme number of rows in the file for this to be an issue.

Unable to findnodes() restricted just to current parent

I'm parsing a simple XML file to create a flat text file from it. The desired outcome is shown below the sample XML. The XML has sort of a header-detail structure (Assembly_Info and Part respectively), with a unique header node followed by any number of detail record nodes, all of which are siblings. After digging into the elements under the header, I can't then find a way back 'up' to then pick up all the sibling detail nodes.
XML file looks like this:
<?xml version="1.0" standalone="yes" ?>
<Wrapper>
<Record>
<Product>
<prodid>4094</prodid>
</Product>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0000</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0455</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>045A</dev_name>
</Part>
</Assembly>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0002</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0457</dev_name>
</Part>
</Assembly>
</Record>
</Wrapper>
For each Assembly I need to read the values of the two elemenmets in Assembly_Info which I do successfully. But, I then want to read each of the Part records that are associated with the Assembly. The objective is to 'flatten' the file into this:
prodid id interface status dev_name
4094 DF-7A C N/A 0000
4094 DF-7A C Ready 0455
4094 DF-7A C Ready 045A
4094 DF-7A C N/A 0002
4094 DF-7A C Ready 0457
I'm attempting to use findnodes() to do this, as that's about the only tool I thought I understood. My code unfortunately reads all of the Part records from the entire file foreach Assembly--since the only way I've been able to find the Part nodes is to start at the root. I don't know how to change 'where I am', if you will; to tell findnodes to begin at current parent. Code looks like this:
my $parser = XML::LibXML -> new();
my $tree = $parser -> parse_file ('DEMO.XML');
for my $product ($tree->findnodes ('/Wrapper/Record/Product/prodid')) {
$prodid = $product->textContent();
}
foreach my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly')){
$assemblies++;
$parts = 0;
for my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly/Assembly_Info')) {
$id = $assembly->findvalue('id');
$interface = $assembly->findvalue('interface');
}
foreach my $part ($tree->findnodes ('/Wrapper/Record/Assembly/Part')) {
$parts++;
$status = $part->findvalue('status');
$dev_name = $part->findvalue('dev_name');
}
print "Assembly No: ", $assemblies, " Parts: ",$parts, "\n";
}
How do I get just the Part nodes for a given Assembly, after I've gone down to the Assembly_Info depths? There is quite a bit I'm not getting, and I think a problem may be that I'm thinking of this as 'navigating' or moving a cursor, if you will. Examples of XPath path expressions have not helped me.
Instead of always using $tree as the starting point for the findnodes method, you can use any other node, especially also child nodes. Then you could use a relative XPath expression. For example:
for my $record ($tree->findnodes('/Wrapper/Record')) {
for my $assembly ($record->findnodes('./Assembly')) {
for my $part ($assembly->findnodes('./Part')) {
}
}
}

REXML parsing an XML in ruby

Folks,
I am using REXML for a sample XML file:
<Accounts title="This is the test title">
<Account name="frenchcustomer">
<username name = "frencu"/>
<password pw = "hello34"/>
<accountdn dn = "https://frenchcu.com/"/>
<exporttest name="basic">
<exportname name = "basicexport"/>
<exportterm term = "oldschool"/>
</exporttest>
</Account>
<Account name="britishcustomer">
<username name = "britishcu"/>
<password pw = "mellow34"/>
<accountdn dn = "https://britishcu.com/"/>
<exporttest name="existingsearch">
<exportname name = "largexpo"/>
<exportterm term = "greatschool"/>
</exporttest>
</Account>
</Accounts>
I am reading the XML like this:
#data = (REXML::Document.new file).root
#dataarr = ##testdata.elements.to_a("//Account")
Now I want to get the username of the frenchcustomer, so I tried this:
#dataarr[#name=fenchcustomer].elements["username"].attributes["name"]
this fails, I do not want to use the array index, for example
#dataarr[1].elements["username"].attributes["name"]
will work, but I don't want to do that, is there something that i m missing here. I want to use the array and get the username of the french user using the Account name.
Thanks a lot.
I recommend you to use XPath.
For the first match, you can use first method, for an array, just use match.
The code above returns the username for the Account "frenchcustomer" :
REXML::XPath.first(yourREXMLDocument, "//Account[#name='frenchcustomer']/username/#name").value
If you really want to use the array created with ##testdata.elements.to_a("//Account"), you could use find method :
french_cust_elt = the_array.find { |elt| elt.attributes['name'].eql?('frenchcustomer') }
french_username = french_cust_elt.elements["username"].attributes["name"]
puts #data.elements["//Account[#name='frenchcustomer']"]
.elements["username"]
.attributes["name"]
If you want to iterate over multiple identical names:
#data.elements.each("//Account[#name='frenchcustomer']") do |fc|
puts fc.elements["username"].attributes["name"]
end
I don't know what your ##testdata are, I tried with the following testcode:
require "rexml/document"
#data = (REXML::Document.new DATA).root
#dataarr = #data.elements.to_a("//Account")
# Works
p #dataarr[1].elements["username"].attributes["name"]
#Works not
#~ p #dataarr[#name='fenchcustomer'].elements["username"].attributes["name"]
##dataarr is an array
#dataarr.each{|acc|
next unless acc.attributes['name'] =='frenchcustomer'
p acc.elements["username"].attributes["name"]
}
##dataarr is an array
puts "===Array#each"
#dataarr.each{|acc|
next unless acc.attributes['name'] =='frenchcustomer'
p acc.elements["username"].attributes["name"]
}
puts "===XPATH"
#data.elements.to_a("//Account[#name='frenchcustomer']").each{|acc|
p acc.elements["username"].attributes["name"]
}
__END__
<Accounts title="This is the test title">
<Account name="frenchcustomer">
<username name = "frencu"/>
<password pw = "hello34"/>
<accountdn dn = "https://frenchcu.com/"/>
<exporttest name="basic">
<exportname name = "basicexport"/>
<exportterm term = "oldschool"/>
</exporttest>
</Account>
<Account name="britishcustomer">
<username name = "britishcu"/>
<password pw = "mellow34"/>
<accountdn dn = "https://britishcu.com/"/>
<exporttest name="existingsearch">
<exportname name = "largexpo"/>
<exportterm term = "greatschool"/>
</exporttest>
</Account>
</Accounts>
I'm not very familiar with rexml, so I expect there is a better solution. But perhaps aomebody can take my code to build a better solution.

Resources