Why do I get errors with XML modified using Nokogiri? - ruby

I am having problems understanding Net::HTTP and Nokogiri.
I have a large number of jobs on my Jenkins server. I have to periodically update the branch name on these jobs. Doing it from the UI is a cumbersome process so I decided to update the Jenkins config.xml.
I use Nokogiri to parse the XML, traverse the XPath and update the value of the node. However, when I try to post the updated XML back to Jenkins, I get a 500 error saying:
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseExceptionpublicId: -//W3C//DTD HTML 4.0 Transitional//EN; systemId: http://www.w3.org/TR/REC-html40/loose.dtd; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'.
Here is what I am doing:
require "net/http"
require "nokogiri"
uri = URI.parse("http://jenkins.my.domain.web:8080")
http = Net::HTTP.new(uri.host, uri.port)
getQueueRequest = Net::HTTP::Get.new("http://jenkins.my.domain.web:8080/my/job/location/config.xml")
getQueue = http.request(getQueueRequest)
xml_doc = Nokogiri::HTML(getQueue.body)
# Get current branch name
branch_name=xml_doc.at_xpath('//hudson.plugins.git.branchspec/name')
# Get new branch name
print "Enter new branch name "
user_input = gets.chomp
new_branch_name = user_input.downcase
# Set branch name and create xml
branch_name.content=new_branch_name
new_config_xml=xml_doc.to_xml
puts "Logging into Jenkins"
update_branch = Net::HTTP::Post.new("http://jenkins.my.domain.web:8080/my/job/location/config.xml")
update_branch.basic_auth 'username', 'password'
update_branch.body = new_config_xml
response = http.request(update_branch)
puts response.body
I understand it might have to do something with the XML that is getting added to request body but I am not sure how to fix the issue.
Original XML:
<?xml version='1.0' encoding='UTF-8'?>
<maven2-moduleset plugin="maven-plugin#1.504">
<actions/>
<description></description>
<keepDependencies>false</keepDependencies>
<properties>
<hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents#1.7.2">
<maxConcurrentPerNode>0</maxConcurrentPerNode>
<maxConcurrentTotal>0</maxConcurrentTotal>
<categories/>
<throttleEnabled>false</throttleEnabled>
<throttleOption>project</throttleOption>
<configVersion>1</configVersion>
</hudson.plugins.throttleconcurrents.ThrottleJobProperty>
</properties>
<scm class="hudson.plugins.git.GitSCM" plugin="git#1.4.0">
<configVersion>2</configVersion>
<userRemoteConfigs>
<hudson.plugins.git.UserRemoteConfig>
<name></name>
<refspec></refspec>
<url>git#github.com:<ORG_NAME>/<REPO_NAME>.git</url>
</hudson.plugins.git.UserRemoteConfig>
</userRemoteConfigs>
<branches>
<hudson.plugins.git.BranchSpec>
<name>release</name>
</hudson.plugins.git.BranchSpec>
</branches>
<disableSubmodules>false</disableSubmodules>
<recursiveSubmodules>false</recursiveSubmodules>
<doGenerateSubmoduleConfigurations>false</doGenerateSubmoduleConfigurations>
<authorOrCommitter>false</authorOrCommitter>
<clean>false</clean>
<wipeOutWorkspace>false</wipeOutWorkspace>
<pruneBranches>false</pruneBranches>
<remotePoll>false</remotePoll>
<ignoreNotifyCommit>false</ignoreNotifyCommit>
<useShallowClone>false</useShallowClone>
<buildChooser class="hudson.plugins.git.util.DefaultBuildChooser"/>
<gitTool>Default</gitTool>
<submoduleCfg class="list"/>
<relativeTargetDir></relativeTargetDir>
<reference></reference>
<excludedRegions></excludedRegions>
<excludedUsers></excludedUsers>
<gitConfigName></gitConfigName>
<gitConfigEmail></gitConfigEmail>
<skipTag>false</skipTag>
<includedRegions></includedRegions>
<scmName></scmName>
</scm>
<canRoam>true</canRoam>
<disabled>false</disabled>
<blockBuildWhenDownstreamBuilding>false</blockBuildWhenDownstreamBuilding>
<blockBuildWhenUpstreamBuilding>false</blockBuildWhenUpstreamBuilding>
<triggers class="vector">
<hudson.triggers.TimerTrigger>
<spec>0 22 * * 4</spec>
</hudson.triggers.TimerTrigger>
</triggers>
<concurrentBuild>false</concurrentBuild>
<rootModule>
<groupId>com.org.project.test</groupId>
<artifactId>functest</artifactId>
</rootModule>
<goals>clean verify -Dtestsuite=<test_suite_name> -Dbrowser=chrome -Dipaddress=http://<IP_ADDRESS>:4444/wd/hub</goals>
<mavenName>apache-maven-3.0.4</mavenName>
<aggregatorStyleBuild>true</aggregatorStyleBuild>
<incrementalBuild>false</incrementalBuild>
<perModuleEmail>true</perModuleEmail>
<ignoreUpstremChanges>false</ignoreUpstremChanges>
<archivingDisabled>false</archivingDisabled>
<resolveDependencies>false</resolveDependencies>
<processPlugins>false</processPlugins>
<mavenValidationLevel>-1</mavenValidationLevel>
<runHeadless>false</runHeadless>
<disableTriggerDownstreamProjects>false</disableTriggerDownstreamProjects>
<settings class="jenkins.mvn.DefaultSettingsProvider"/>
<globalSettings class="jenkins.mvn.DefaultGlobalSettingsProvider"/>
<reporters/>
<publishers/>
<buildWrappers/>
<prebuilders/>
<postbuilders/>
<runPostStepsIfResult>
<name>FAILURE</name>
<ordinal>2</ordinal>
<color>RED</color>
</runPostStepsIfResult>
</maven2-moduleset>
After Editing and Massaging:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml version="1.0" encoding="UTF-8"?>
<html>
<body>
<maven2-moduleset plugin="maven-plugin#1.504">
<actions />
<description />
<keepdependencies>false</keepdependencies>
<properties>
<hudson.plugins.throttleconcurrents.throttlejobproperty plugin="throttle-concurrents#1.7.2">
<maxconcurrentpernode>0</maxconcurrentpernode>
<maxconcurrenttotal>0</maxconcurrenttotal>
<categories />
<throttleenabled>false</throttleenabled>
<throttleoption>project</throttleoption>
<configversion>1</configversion>
</hudson.plugins.throttleconcurrents.throttlejobproperty>
</properties>
<scm class="hudson.plugins.git.GitSCM" plugin="git#1.4.0">
<configversion>2</configversion>
<userremoteconfigs>
<hudson.plugins.git.userremoteconfig>
<name />
<refspec />
<url>git#github.com:<ORG_NAME>/<REPO_NAME>.git</url>
</hudson.plugins.git.userremoteconfig>
</userremoteconfigs>
<branches>
<hudson.plugins.git.branchspec>
<name>master</name>
</hudson.plugins.git.branchspec>
</branches>
<disablesubmodules>false</disablesubmodules>
<recursivesubmodules>false</recursivesubmodules>
<dogeneratesubmoduleconfigurations>false</dogeneratesubmoduleconfigurations>
<authororcommitter>false</authororcommitter>
<clean>false</clean>
<wipeoutworkspace>false</wipeoutworkspace>
<prunebranches>false</prunebranches>
<remotepoll>false</remotepoll>
<ignorenotifycommit>false</ignorenotifycommit>
<useshallowclone>false</useshallowclone>
<buildchooser class="hudson.plugins.git.util.DefaultBuildChooser" />
<gittool>Default</gittool>
<submodulecfg class="list" />
<relativetargetdir />
<reference />
<excludedregions />
<excludedusers />
<gitconfigname />
<gitconfigemail />
<skiptag>false</skiptag>
<includedregions />
<scmname />
</scm>
<canroam>true</canroam>
<disabled>false</disabled>
<blockbuildwhendownstreambuilding>false</blockbuildwhendownstreambuilding>
<blockbuildwhenupstreambuilding>false</blockbuildwhenupstreambuilding>
<triggers class="vector">
<hudson.triggers.timertrigger>
<spec>0 22 * * 4</spec>
</hudson.triggers.timertrigger>
</triggers>
<concurrentbuild>false</concurrentbuild>
<rootmodule>
<groupid>com.org.project.test</groupid>
<artifactid>functest</artifactid>
</rootmodule>
<goals>clean verify -Dtestsuite=<test_suite_name> -Dbrowser=chrome -Dipaddress=http://<IP_ADDRESS>:4444/wd/hub</goals>
<mavenname>apache-maven-3.0.4</mavenname>
<aggregatorstylebuild>true</aggregatorstylebuild>
<incrementalbuild>false</incrementalbuild>
<permoduleemail>true</permoduleemail>
<ignoreupstremchanges>false</ignoreupstremchanges>
<archivingdisabled>false</archivingdisabled>
<resolvedependencies>false</resolvedependencies>
<processplugins>false</processplugins>
<mavenvalidationlevel>-1</mavenvalidationlevel>
<runheadless>false</runheadless>
<disabletriggerdownstreamprojects>false</disabletriggerdownstreamprojects>
<settings class="jenkins.mvn.DefaultSettingsProvider" />
<globalsettings class="jenkins.mvn.DefaultGlobalSettingsProvider" />
<reporters />
<publishers />
<buildwrappers />
<prebuilders />
<postbuilders />
<runpoststepsifresult>
<name>FAILURE</name>
<ordinal>2</ordinal>
<color>RED</color>
</runpoststepsifresult>
</maven2-moduleset>
</body>
</html>

When you use Nokogiri::HTML(some_html) or Nokogiri::XML(some_xml), Nokogiri will look to see if the content is valid. If it isn't, it will do fix-ups on the content in an attempt to make it so. For instance:
require 'nokogiri'
html_fragment = "<p>foo bar</p>"
Nokogiri::HTML(html_fragment).to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo bar</p></body></html>\n"
If the document is partially correct Nokogiri still adds the DOCTYPE statement:
html = "<html><body><p>foo bar</p></body></html>"
Nokogiri::HTML(html).to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo bar</p></body></html>\n"
If you want Nokogiri to leave the document along, because it's supposed to be a fragment, tell it to do so:
Nokogiri::HTML::DocumentFragment.parse(html_fragment).to_html
# => "<p>foo bar</p>"
Or:
xml_fragment = "<x>foo bar</x>"
Nokogiri::XML::DocumentFragment.parse(xml_fragment).to_xml
# => "<x>foo bar</x>"
Nokogiri is pretty smart about handling XML and HTML. You can try to confuse it and it'll generally do the right thing:
xml_fragment = "<x>foo bar</x>"
Nokogiri::HTML::DocumentFragment.parse(xml_fragment).to_xml
# => "<x>foo bar</x>"
That's parsing XML as an HTML fragment and telling it to emit it as XML.
Now, that all said, it's pretty obvious Nokogiri isn't doing anything mysterious, so, here's how to fix the problem. First, parse it as XML so Nokogiri doesn't think it should add the HTML DOCTYPE declaration, then, if the XML is syntactically correct, tell Nokogiri it's OK to parse it as a complete document:
require 'nokogiri'
xml = %{<?xml version='1.0' encoding='UTF-8'?>
<maven2-moduleset plugin="maven-plugin#1.504">
<actions/>
<description></description>
<keepDependencies>false</keepDependencies>
<properties>
<hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents#1.7.2">
</hudson.plugins.throttleconcurrents.ThrottleJobProperty>
</properties>
</maven2-moduleset>
}
puts Nokogiri::XML.parse(xml).to_xml
# >> <?xml version="1.0" encoding="UTF-8"?>
# >> <maven2-moduleset plugin="maven-plugin#1.504">
# >> <actions/>
# >> <description/>
# >> <keepDependencies>false</keepDependencies>
# >> <properties>
# >> <hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents#1.7.2">
# >> </hudson.plugins.throttleconcurrents.ThrottleJobProperty>
# >> </properties>
# >> </maven2-moduleset>
Or as a fragment, which, because it's complete, will result in the same thing:
puts Nokogiri::XML::DocumentFragment.parse(xml).to_xml
# >> <?xml version='1.0' encoding='UTF-8'?>
# >> <maven2-moduleset plugin="maven-plugin#1.504">
# >> <actions/>
# >> <description/>
# >> <keepDependencies>false</keepDependencies>
# >> <properties>
# >> <hudson.plugins.throttleconcurrents.ThrottleJobProperty plugin="throttle-concurrents#1.7.2">
# >> </hudson.plugins.throttleconcurrents.ThrottleJobProperty>
# >> </properties>
# >> </maven2-moduleset>
Instead of using Net::HTTP, which is the bare-building blocks for HTTP, I'd recommend looking at something a bit higher-level, like HTTPClient. Here's code that is similar to yours:
require 'httpclient'
require 'nokogiri'
URL = 'http://jenkins.my.domain.web:8080/my/job/location/config.xml'
http_client = HTTPClient.new
xml_doc = Nokogiri::HTML(
http_client.get_content(URL)
)
# Get current branch name using CSS for simplicity:
branch_name = xml_doc.at('hudson.plugins.git.branchspec name')
# Get new branch name
print 'Enter new branch name '
new_branch_name = gets.chomp.downcase
# Set branch name and create xml
branch_name.content = new_branch_name
puts 'Logging into Jenkins'
http_client.set_auth(domain, 'user', 'password')
response = http_client.post(URL, :body => xml_doc.to_xml)
I can't test it but it looks close.
I, now, find myself in another dilemma. I am seeing that the methods which allow moving to elements and editing values like at_xpath, at_css only work with Nokogiri::HTML or Nokogiri::HTML::DocumentFragment. They don't work when I use Nokogiri::XML. Using Nokogiri::HTML changes the case of the HTML tags. false becomes false. Jenkins does accept the xml with changed case of tags. Methods to_html, to_xml basically returns a string so I cannot use the xpath or css methods to navigate the xml tree. Is there a way around ?
The at methods work with both XML and HTML, and allows CSS and XPath selectors; Everything inside Nokogiri is really XML-based.
Nokogiri folds HTML tags to lower-case because HTML is case-insensitive, so at expects a lower-case value when dealing with HTML. XML is case-sensitive, so Nokogiri leaves the tag case alone, and at requires you to use the correct case when using CSS.
This is documented in the Nokogiri docs:
Note that the CSS query string is case-sensitive with regards to your document type. That is, if you’re looking for “H1” in an HTML document, you’ll never find anything, since HTML tags will match only lowercase CSS queries. However, “H1” might be found in an XML document, where tags names are case-sensitive (e.g., “H1” is distinct from “h1”).

When you are parsing the XML you are receiving from the service, you are declaring it as HTML:
xml_doc = Nokogiri::HTML(getQueue.body)
And this appears to cause Nokogiri to add HTML nodes.
Try parsing it as XML instead:
xml_doc = Nokogiri::XML(getQueue.body)

Related

Nokogiri parsing through XML fails

The code:
response = Nokogiri::XML(open('https://geocode-maps.yandex.ru/1.x/?geocode=%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3+%D0%A1%D0%B2%D0%B5%D1%80%D0%B4%D0%BB%D0%BE%D0%B2%D1%81%D0%BA%D0%B0%D1%8F+%D0%BD%D0%B0%D0%B1%D0%B5%D1%80%D0%B5%D0%B6%D0%BD%D0%B0%D1%8F+44%D0%A2'), nil, Encoding::UTF_8.to_s)
lowerCorner = response.xpath("//lowerCorner")
XML document I parse is like:
<?xml version="1.0" encoding="utf-8"?>
<ymaps xmlns="http://maps.yandex.ru/ymaps/1.x" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://maps.yandex.ru/business/1.x http://maps.yandex.ru/schemas/business/1.x/business.xsd http://maps.yandex.ru/geocoder/1.x http://maps.yandex.ru/schemas/geocoder/1.x/geocoder.xsd http://maps.yandex.ru/psearch/1.x http://maps.yandex.ru/schemas/psearch/1.x/psearch.xsd http://maps.yandex.ru/search/1.x http://maps.yandex.ru/schemas/search/1.x/search.xsd http://maps.yandex.ru/web/1.x http://maps.yandex.ru/schemas/web/1.x/web.xsd http://maps.yandex.ru/search/internal/1.x http://maps.yandex.ru/schemas/search/internal/1.x/internal.xsd">
<GeoObjectCollection>
<metaDataProperty xmlns="http://www.opengis.net/gml">
<GeocoderResponseMetaData xmlns="http://maps.yandex.ru/geocoder/1.x">
<request>Санкт-Петербург Свердловская набережная 44Т</request>
<found>1</found>
<results>10</results>
</GeocoderResponseMetaData>
</metaDataProperty>
<featureMember xmlns="http://www.opengis.net/gml">
<GeoObject xmlns="http://maps.yandex.ru/ymaps/1.x" xmlns:gml="http://www.opengis.net/gml" gml:id="1">
<metaDataProperty xmlns="http://www.opengis.net/gml">
<GeocoderMetaData xmlns="http://maps.yandex.ru/geocoder/1.x">
<kind>house</kind>
<text>Россия, Санкт-Петербург, Свердловская набережная, 44Т</text>
<precision>exact</precision>
</GeocoderMetaData>
</metaDataProperty>
<Envelope>
<lowerCorner>30.397902 59.959183</lowerCorner>
<upperCorner>30.406113 59.9633</upperCorner>
</Envelope>
</boundedBy>
<Point xmlns="http://www.opengis.net/gml">
<pos>30.402008 59.961242</pos>
</Point>
</GeoObject>
</featureMember>
</GeoObjectCollection>
</ymaps>
I'd like to get lowerCorner, but nothing from official and others sources does work:
response.xpath('//lowerCorner')
response.search('//lowerCorner')
response.xpath('xmlns:lowerCorner')
response.xpath('xmlns:lowerCorner', ns).text
response.css('lowerCorner')
The only result is: []
So how to parse lowerCorner's content?
Removing the namespaces (or using them in your path) should help.
Try this:
require "nokogiri"
require "open-uri"
response = Nokogiri::XML(open('https://geocode-maps.yandex.ru/1.x/?geocode=%D0%A1%D0%B0%D0%BD%D0%BA%D1%82-%D0%9F%D0%B5%D1%82%D0%B5%D1%80%D0%B1%D1%83%D1%80%D0%B3+%D0%A1%D0%B2%D0%B5%D1%80%D0%B4%D0%BB%D0%BE%D0%B2%D1%81%D0%BA%D0%B0%D1%8F+%D0%BD%D0%B0%D0%B1%D0%B5%D1%80%D0%B5%D0%B6%D0%BD%D0%B0%D1%8F+44%D0%A2'), nil, Encoding::UTF_8.to_s)
response.remove_namespaces! # <<<<<<<
lower_corner = response.xpath("/ymaps/GeoObjectCollection/featureMember/GeoObject/boundedBy/Envelope/lowerCorner").first
p lower_corner.text #> "30.397902 59.959183"

Ruby Savon - Parse XML string

I have exhausted google on this subject and I just can't seem to get it right..
I have the following XML payload returned from Savon:
<?xml version='1.0' encoding='UTF-8'?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
<ns:listGFUsersResponse xmlns:ns="http://ws.fds.com">
<ns:return>
<responseCode>0000</responseCode><responseDescription>No Errors-DWI</responseDescription><user><login>aa1283</login><name>Andrew Alonzo</name><team>DIALER</team><secLev>-1</secLev><maxDiscount>0.00</maxDiscount><phoneSystemId></phoneSystemId></user><user><login>aaronc</login><name>Aaron Callison</name><team></team><secLev>-1</secLev><maxDiscount>0.00</maxDiscount><phoneSystemId></phoneSystemId></user>
</ns:return>
</ns:listGFUsersResponse>
</soapenv:Body>
</soapenv:Envelope>
I would like to parse out ALL values of <name> * </name> and <login> * </login>
A few of my attempts here:
response1 = client1.call(
:list_gf_users,
message: message)
doc = Nokogiri::XML(response1.to_s)
pp doc
p doc.search('/name').text
p doc.search('/login').text
Nothing returned...
doc = Nokogiri::XML(response1.to_s)
value = doc.xpath('/name').map(&:text)
puts value
Nada....
doc = Nokogiri::XML(response1.to_s)
value = doc.xpath('/user[name]').map(&:text)
puts value
Zilch...
would love to be able to see:
name: Andrew Alonzo
login: aa1283
or even better a Hash?
{"aa1283" => "Andrew Alonzo"}
Getting 0 results such as:
""
[]
nil
Figured it out... probably not most efficient but gets the job done:
Convert Savon response to string(can't use scan on Savon output)
doc = response1.to_s
subFile = doc.gsub("<","<") #Replace the string convert characters
Run scan using regex capture groups:
#user = subFile.scan /<user><login>(.+?)<\/login><name>(.*?)<\/name>.+?><\/user>/
In your comments you have
doc = response1.doc
which gives you a Nokogiri document. With that you should be able to do the following:
doc.xpath("//user").each do |user|
login = user.at("login")&.text
name = user.at("name")&.text
puts "#{login}: #{name}"
end
The output is
aa1283: Andrew Alonzo
aaronc: Aaron Callison
I used the XML from your comment:
<root>
<responseCode>0000</responseCode>
<responseDescription>No Errors-DWI</responseDescription>
<user>
<login>aa1283</login>
<name>Andrew Alonzo</name>
<team>DIALER</team>
<secLev>-1</secLev>
<maxDiscount>0.00</maxDiscount>
<phoneSystemId></phoneSystemId>
</user>
<user>
<login>aaronc</login>
<name>Aaron Callison</name>
<team></team>
<secLev>-1</secLev>
<maxDiscount>0.00</maxDiscount>
<phoneSystemId></phoneSystemId>
</user>
</root>
Note that I had to convert this to plaintext. You have some non-printing unicode characters sprinkled throughout the document in seemingly random places (which makes me wonder if that's actually the cause of your problems).

How to use Nokogiri to combine multiple like-formatted XML files into CSV

I want to parse multiple like-formatted XML files into a CSV file.
I searched on Google, nokogiri.org, and on SO but I haven't been able to find an answer.
I have ten XML files in identical format in terms of node/element structure, that reside in the current directory.
After combining the XML files into a single XML file, I need to pull out specific elements of the advisory node. I would like to output the link, title, location, os -> language -> name, and reference -> name data to the CSV file.
My code is only able to parse a single XML document and I'd like it to take into account 1:many:
# Parse the XML file into a Nokogiri::XML::Document object
#doc = Nokogiri::XML(File.open("file.xml"))
# Gather the 5 specific XML elements out of the 'advisory' top-level node
data = #doc.search('advisory').map { |adv|
[
adv.at('link').content,
adv.at('title').content,
adv.at('location').content,
adv.at('os > language > name').content,
adv.at('reference > name').content
]
}
# Loop through each array element in the object and write out as CSV row
CSV.open('output_file.csv', 'wb') do |csv|
# Explicitly set headers until you figure out how to get them programatically
csv << ['Link', 'Title', 'Location', 'OS Name', 'Reference Name']
data.each do |row|
csv << row
end
end
I tried changing the code to support multiple XML files and get them into Nokogiri::XML::Document objects:
xml_docs = []
Dir.glob("*.xml").each do |file|
xml = Nokogiri::XML(File.new(file))
xml_docs << Nokogiri::XML::Document.new(xml)
end
This successfully creates an array xml_docs with the correct objects it in, but I don't know how to convert these six objects into a single object.
This is sample XML. All XML files use the same node/element structure:
<advisories>
<title> Not relevant </title>
<customer> N/A </customer>
<advisory id="12345">
<link> https://www.google.com </link>
<release_date>2016-04-07</release_date>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<product>
<id>98765</id>
<name>Product Name</name>
</product>
<language>
<id>123</id>
<name>en</name>
</language>
</os>
<reference>
<id>00029</id>
<name>Full</name>
<area>Not Defined</area>
</reference>
</advisory>
<advisory id="98765">
<link> https://www.msn.com </link>
<release_date>2016-04-08</release_date>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<product>
<id>12654</id>
<name>Product Name</name>
</product>
<language>
<id>126</id>
<name>fr</name>
</language>
</os>
<reference>
<id>00052</id>
<name>Partial</name>
<area>Defined</area>
</reference>
</advisory>
</advisories>
The code leverages Nokogiri::XML::Document but if Nokogiri::XML::Builder will work better for this, I am more than willing to adjust my code accordingly.
I'd handle the first part, of parsing one XML file, like this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<advisories>
<advisory id="12345">
<link> https://www.google.com </link>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<language>
<name>en</name>
</language>
</os>
<reference>
<name>Full</name>
</reference>
</advisory>
<advisory id="98765">
<link> https://www.msn.com </link>
<release_date>2016-04-08</release_date>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<language>
<name>fr</name>
</language>
</os>
<reference>
<name>Partial</name>
</reference>
</advisory>
</advisories>
EOT
Note: This has nodes removed because they weren't important to the question. Please remove fluff when asking as it's distracting.
With this being the core of the code:
doc.search('advisory').map{ |advisory|
link = advisory.at('link').text
title = advisory.at('title').text
location = advisory.at('location').text
os_language_name = advisory.at('os > language > name').text
reference_name = advisory.at('reference > name').text
{
link: link,
title: title,
location: location,
os_language_name: os_language_name,
reference_name: reference_name
}
}
That could be DRY'd but was written as an example of what to do.
Running that results in an array of hashes, which would be easily output via CSV:
# => [
{:link=>" https://www.google.com ", :title=>" The Short Description Would Go Here ", :location=>" Location Name Here ", :os_language_name=>"en", :reference_name=>"Full"},
{:link=>" https://www.msn.com ", :title=>" The Short Description Would Go Here ", :location=>" Location Name Here ", :os_language_name=>"fr", :reference_name=>"Partial"}
]
Once you've got that working then fit it into a modified version of your loops to output CSV and read the XML files. This is untested but looks about right:
CSV.open('output_file.csv', 'w',
headers: ['Link', 'Title', 'Location', 'OS Name', 'Reference Name'],
write_headers: true
) do |csv|
Dir.glob("*.xml").each do |file|
xml = Nokogiri::XML(File.read(file))
# parse a file and get the array of hashes
end
# pass the array of hashes to CSV for output
end
Note that you were using a file mode of 'wb'. You rarely need b with CSV as CSV is supposed to be a text format. If you are sure you will encounter binary data then use 'b' also, but that could lead down a path containing dragons.
Also note that this is using read. read is not scalable, which means it doesn't care how big a file is, it's going to try to read it into memory, whether or not it'll actually fit. There are lots of reasons to avoid that, but the best is it'll take your program to its knees. If your XML files could exceed the available free memory for your system then you'll want to rewrite using a SAX parser, which Nokogiri supports. How to do that is a different question.
it was actually an Array of array of hashes. I'm not sure how I ended up there but I was easily able to use array.flatten
Meditate on this:
foo = [] # => []
foo << [{}] # => [[{}]]
foo.flatten # => [{}]
You probably wanted to do this:
foo = [] # => []
foo += [{}] # => [{}]
Any time I have to use flatten I look to see if I can create the array without it being an array of arrays of something. It's not that they're inherently bad, because sometimes they're very useful, but you really wanted an array of hashes so you knew something was wrong and flatten was a cheap way out, but using it also costs more CPU time. It's better to figure out the problem and fix it and end up with faster/more efficient code. (And some will say that's a wasted effort or is premature optimization, but writing efficient code is a very good trait and goal.)

Testing Nokogiri XML generation with blank nodes

I'm having a bit of trouble testing some XML generation using Nokogiri when the node is blank. I'm using Minitest to compare the generated XML string with a template fixture file. My test fails with the blank node as Minitest is comparing <Node></Node> with <Node />.
XML Generation
builder = Nokogiri::XML::Builder.new encoding: "UTF-8" do |xml|
xml.Header
xml.FileName #object.filename
end
Template file
This is the file I'm using as a fixture in my tests
<?xml version="1.0" encoding="UTF-8"?>
<Header/>
<FileName></FileName>
Minitest output
3) Failure:
--- expected
+++ actual
## -25,7 +25,7 ##
<Header />
- <FileName/>
+ <FileName></FileName>
As you can see, MiniTest is trying to compare a self-closing tag with a non-self-closing tag and making the test fail. Changing the fixture tag to a self-closing one results, strangely, in exactly the same error message.
It's because sometimes #object.filename is nil - if I have a blank XML node (as in xml.Header above) using a self-closing tag in my fixture works no problem.
I would use XML schema in this case:
def test_that_xml_data_conforms_to_schema
xml_data = ...
schema_data = ...
fragment = Nokogiri::XML.parse(xml_data)
schema = Nokogiri::XML::Schema(schema_data)
assert schema.valid?(fragment)
end

Nokogiri XML Searching

I've tried reading the Nokogiri docs, etc, but I've came to a road block.
I get an XML output similar to
<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns1:getPoliciesResponse xmlns:ns1="http://policy.api.control.r1soft.com/">
<return>
<CDPId>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx</CDPId>
<description/>
<diskSafeID>bcb68765-a719-4291-912d-2e6af485ea24</diskSafeID>
<enabled>true</enabled>
<id>cdb65427-d6f4-4a89-9f77-8763e22dc74b</id>
<lastReplicationRunTime>2013-06-12T13:29:40.105-05:00</lastReplicationRunTime>
<name>pstueck-passenger ondemand</name>
<replicationScheduleFrequencyType>ON_DEMAND</replicationScheduleFrequencyType>
<state>OK</state>
</return>
<return>
<CDPId>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx</CDPId>
<description/>
<diskSafeID>e8e13555-f577-40d2-99c8-fa8a019d3b55</diskSafeID>
<enabled>true</enabled>
<id>7f55f8d6-92a9-4b14-bff4-631559d92259</id>
<lastReplicationRunTime>2013-06-16T22:00:04.918-05:00</lastReplicationRunTime>
<name>pstueck-mysql daily</name>
<nextReplicationRunTime>2013-06-17T22:00:00-05:00</nextReplicationRunTime>
<replicationScheduleFrequencyType>DAILY</replicationScheduleFrequencyType>
<state>ALERT</state>
<warnings>Policy last completed with alerts</warnings>
</return>
</ns1:getPoliciesResponse>
</soap:Body>
</soap:Envelope>
But I have a large # of 'return' sections that get displayed back. I'm trying to use the .search at the end of string. I'm only wanting it to return the entire 'return' section for a given 'name'. Anyone have any tips?
Current Code:
client = Savon::Client.new do
http.auth.basic "#{opts['api_username']}", "#{opts['api_password']}"
wsdl.document = "#{opts['api_url']}/Policy?wsdl"
end
getPolicyInformation = client.request :getPolicies
getPolicyInformation = Nokogiri::XML(getPolicyInformation.to_xml)
print getPolicyInformation
I'm wanting to return everything in the <return> section if I search for a specified <name>. Example: I only want to see the information relating to <name>pstueck-passenger ondemand</name>, but the entire <return> section that contains that.
You can use XPath to identify a node with a particular value and then specify that an ancestor element is of interest by doing something like the following:
require 'nokogiri'
document = <<-XML
<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns1:getPoliciesResponse xmlns:ns1="http://policy.api.control.r1soft.com/">
<return>
<CDPId>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx</CDPId>
<description/>
<diskSafeID>bcb68765-a719-4291-912d-2e6af485ea24</diskSafeID>
<enabled>true</enabled>
<id>cdb65427-d6f4-4a89-9f77-8763e22dc74b</id>
<lastReplicationRunTime>2013-06-12T13:29:40.105-05:00</lastReplicationRunTime>
<name>pstueck-passenger ondemand</name>
<replicationScheduleFrequencyType>ON_DEMAND</replicationScheduleFrequencyType>
<state>OK</state>
</return>
<return>
<CDPId>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx</CDPId>
<description/>
<diskSafeID>e8e13555-f577-40d2-99c8-fa8a019d3b55</diskSafeID>
<enabled>true</enabled>
<id>7f55f8d6-92a9-4b14-bff4-631559d92259</id>
<lastReplicationRunTime>2013-06-16T22:00:04.918-05:00</lastReplicationRunTime>
<name>pstueck-mysql daily</name>
<nextReplicationRunTime>2013-06-17T22:00:00-05:00</nextReplicationRunTime>
<replicationScheduleFrequencyType>DAILY</replicationScheduleFrequencyType>
<state>ALERT</state>
<warnings>Policy last completed with alerts</warnings>
</return>
</ns1:getPoliciesResponse>
</soap:Body>
</soap:Envelope>
XML
doc = Nokogiri::XML(document)
ns = { 'soap' => 'http://schemas.xmlsoap.org/soap/envelope/', 'ns1' => "http://policy.api.control.r1soft.com/" }
ret = doc.xpath('/soap:Envelope/soap:Body/ns1:getPoliciesResponse/return/name[text()="pstueck-passenger ondemand"]/ancestor::return', ns)
puts ret.count
puts ret.at('replicationScheduleFrequencyType').text
EDIT
Updated to reflect updated XML body in question. Now handles namespaces.
Using CSS to find the node:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<ns1:getPoliciesResponse xmlns:ns1="http://policy.api.control.r1soft.com/">
<return>
<CDPId>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx</CDPId>
<description/>
<diskSafeID>e8e13555-f577-40d2-99c8-fa8a019d3b55</diskSafeID>
<enabled>true</enabled>
<id>7f55f8d6-92a9-4b14-bff4-631559d92259</id>
<lastReplicationRunTime>2013-06-16T22:00:04.918-05:00</lastReplicationRunTime>
<name>pstueck-mysql daily</name>
<nextReplicationRunTime>2013-06-17T22:00:00-05:00</nextReplicationRunTime>
<replicationScheduleFrequencyType>DAILY</replicationScheduleFrequencyType>
<state>ALERT</state>
<warnings>Policy last completed with alerts</warnings>
</return>
<return>
<CDPId>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx</CDPId>
<description/>
<diskSafeID>bcb68765-a719-4291-912d-2e6af485ea24</diskSafeID>
<enabled>true</enabled>
<id>cdb65427-d6f4-4a89-9f77-8763e22dc74b</id>
<lastReplicationRunTime>2013-06-12T13:29:40.105-05:00</lastReplicationRunTime>
<name>pstueck-passenger ondemand</name>
<replicationScheduleFrequencyType>ON_DEMAND</replicationScheduleFrequencyType>
<state>OK</state>
</return>
</ns1:getPoliciesResponse>
</soap:Body>
</soap:Envelope>
EOT
return_tag = doc.at('return name[text()="pstueck-passenger ondemand"]').parent
puts return_tag.to_xml
Which outputs:
<return>
<CDPId>XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx</CDPId>
<description/>
<diskSafeID>bcb68765-a719-4291-912d-2e6af485ea24</diskSafeID>
<enabled>true</enabled>
<id>cdb65427-d6f4-4a89-9f77-8763e22dc74b</id>
<lastReplicationRunTime>2013-06-12T13:29:40.105-05:00</lastReplicationRunTime>
<name>pstueck-passenger ondemand</name>
<replicationScheduleFrequencyType>ON_DEMAND</replicationScheduleFrequencyType>
<state>OK</state>
</return>
Nokogiri supports both XPath and CSS. I find CSS easier to read.
I used the at method to find the first matching occurrence, and to show that it was the first matching, I swapped the order of the two <return> blocks. at is the same as search(...).first so when you're looking for the first instance of something in a document at is the way to go.
Nokogiri is usually smart enough to know the difference between XPath and CSS selectors, so we can use the generic at and search. If you need to force CSS or XPath parsing because the selector is gender-unspecific, you can use the specific css or xpath or at_css or at_xpath respectively. They're all documented in the Nokogiri::XML::Node docs.
parent is necessary because we want the parent of the selected node, which was <name>. I just slammed it into reverse and backed up a block. That is easier to do in XPath, where we can use .. to point to the parent node.

Resources