How to extract xml attributes using Xpath in Pig? - xpath

I wanted to extract the attributes form an xml using Pig Latin.
This is a sample of the xml file
<CATALOG>
<BOOK>
<TITLE test="test1">Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
</CATALOG>
I used this script but it didn't work:
REGISTER ./piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD './books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'BOOK/TITLE/#test'), XPath(x, 'BOOK/PRICE');
dump B;
The output was:
(,24.90)
I hope someone can help me with this.
Thanks.

There are 2 bugs in piggybank's XPath class:
The ignoreNamespace logic breaks searching for XML attributes
https://issues.apache.org/jira/browse/PIG-4751
The ignoreNamepace parameter is defaulted to true and cannot be overwritten
https://issues.apache.org/jira/browse/PIG-4752
Here is my workaround using XPathAll:
XPathAll(x, 'BOOK/TITLE/#test', true, false).$0 as (test:chararray)
Also if you still need to ignore namespaces:
XPathAll(x, '//*[local-name()=\'BOOK\']//*[local-name()=\'TITLE\']/#test', true, false).$0 as (test:chararray)

Related

Read value from XML within another XML: Mule

I am making a SOAP webservice call and I get the below response. I want to read the value in internal XML, the value is 12345684 in 1234684 in the below XML.
I was able to get internal XML using #[xpath3('//:processaResponse /return[2]')], store it in a flow variable and #[xpath3('/AckReg/DataArea/PRegistration/PRDet/Person/IDSet/:ID[#schemeName="aid"]/text()')].
This works when I try an online parser, but it doesn't read the value in Mule.
Is there any way to extract 1234684 in oa:ID tag using one XPath.
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header>
<ns3:TXID xmlns:ns3="http://a.d.r.test.com/"></ns3:TXID>
<ns3:SESSIONID xmlns:ns3="http://a.d.r.test.com/"></ns3:SESSIONID>
</soapenv:Header>
<soapenv:Body>
<ns3:processaResponse xmlns:ns3="http://a.d.r.test.com/" xmlns:ns2="http://p.r.test.com/">
<return>Hi</return>
<return>
<?xml version="1.0" encoding="UTF-8"?>
<AckReg
xmlns="http://www.test.com/e/1" languageCode="en-US" releaseID="normalizedString" systemEnvironmentCode="test" versionID="normalizedString"
xmlns:oa="www.test.com/r/9"
xsi:schemaLocation="http://www.test.com/a/1 ../test/test.xsd">
<Apa>
<oa:CreationDateTime>2018-04-05</oa:CreationDateTime>
</Apa>
<DataArea>
<Ack>
<OArea>
<o:Sender>
<o:LID schemeAgencyName="testi" schemeName="Application ID">test</o:LID>
</o:Sender>
</OArea>
<OriginalActionVerb/>
</Ack>
<PRegistration>
<testids>
<IDSet schemeAgencyName="try">
<oa:ID schemeName="abcid">1234684</oa:ID>
</IDSet>
</testids>
<PRDet>
<Person>
<IDSet schemeAgencyName="try">
<oa:ID schemeName="aid">1364561</oa:ID>
</IDSet>
<IDSet schemeAgencyName="enada">
<oa:ID schemeName="Employee ID">adsad</oa:ID>
</IDSet>
</Person>
<User>
<oa:ID/>
</User>
</PRDet>
</PRegistration>
</DataArea>
</AckReg>
</return>
</ns3:processaResponse>
</soapenv:Body>
</soapenv:Envelope>
In your expressions you were missing namespace prefixes or namespace wildcards *: on some nodes - so your expressions failed.
Is there any way to extract 1234684 in oa:ID tag using one XPath.
Combining both of your partial expressions is possible with namespace wildcards:
//*:processaResponse/return[2]/*:AckReg/*:DataArea/*:PRegistration/*:testids/*:IDSet/*:ID[#schemeName='abcid']/text()
Or you can use an absolute path with namespace wildcards:
/*:Envelope/*:Body/*:processaResponse/return[2]/*:AckReg/*:DataArea/*:PRegistration/*:testids/*:IDSet/*:ID[#schemeName='abcid']/text()
Output in both cases:
1234684
You can even use XmlSlurper class using groovy script to fetch that respective value.
root = new XmlSlurper( false, true).parseText(payload).declareNamespace('soapenv':"http://schemas.xmlsoap.org/soap/envelope/")

XPath for stackoverflow dump files

Am working with file with following format:
<badges>
<row Id="1" UserId="1" Name="Teacher" Date="2009-09-30T15:17:50.66"/>
<row Id="2" UserId="3" Name="Teacher" Date="2009-09-30T15:17:50.69"/>
</badges>
I am using pig xmlloader to fetch this xml data into hdfs.
A = LOAD '/badges' using org.apache.pig.piggybank.storage.XMLLoader('row') as (x:chararray);
B = foreach A generate xpath(x, '/row#Id').
Dump B;
Output I get () - No values.
I want the file output as text i.e 1,1,Teacher,2009-09-30T15:17:50.66. How can I do this?
I'm not familiar with pig xmlloader, but /row#Id has two problems:
It's not valid XPath
If it were, it would be an absolute path
Try:
B = foreach A generate xpath(x, 'row/#Id').
It uses valid syntax and a relative path.
Use XPathAll for extracting attributes.Xpath has an issue when it comes to attributes.
REGISTER '/path/piggybank-0.15.0.jar'; -- Use the jar name you downloaded
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
B = foreach A generate XPathAll(x, 'row/#Id', true, false).$0 as (id:chararray);

How to use Nokogiri to combine multiple like-formatted XML files into CSV

I want to parse multiple like-formatted XML files into a CSV file.
I searched on Google, nokogiri.org, and on SO but I haven't been able to find an answer.
I have ten XML files in identical format in terms of node/element structure, that reside in the current directory.
After combining the XML files into a single XML file, I need to pull out specific elements of the advisory node. I would like to output the link, title, location, os -> language -> name, and reference -> name data to the CSV file.
My code is only able to parse a single XML document and I'd like it to take into account 1:many:
# Parse the XML file into a Nokogiri::XML::Document object
#doc = Nokogiri::XML(File.open("file.xml"))
# Gather the 5 specific XML elements out of the 'advisory' top-level node
data = #doc.search('advisory').map { |adv|
[
adv.at('link').content,
adv.at('title').content,
adv.at('location').content,
adv.at('os > language > name').content,
adv.at('reference > name').content
]
}
# Loop through each array element in the object and write out as CSV row
CSV.open('output_file.csv', 'wb') do |csv|
# Explicitly set headers until you figure out how to get them programatically
csv << ['Link', 'Title', 'Location', 'OS Name', 'Reference Name']
data.each do |row|
csv << row
end
end
I tried changing the code to support multiple XML files and get them into Nokogiri::XML::Document objects:
xml_docs = []
Dir.glob("*.xml").each do |file|
xml = Nokogiri::XML(File.new(file))
xml_docs << Nokogiri::XML::Document.new(xml)
end
This successfully creates an array xml_docs with the correct objects it in, but I don't know how to convert these six objects into a single object.
This is sample XML. All XML files use the same node/element structure:
<advisories>
<title> Not relevant </title>
<customer> N/A </customer>
<advisory id="12345">
<link> https://www.google.com </link>
<release_date>2016-04-07</release_date>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<product>
<id>98765</id>
<name>Product Name</name>
</product>
<language>
<id>123</id>
<name>en</name>
</language>
</os>
<reference>
<id>00029</id>
<name>Full</name>
<area>Not Defined</area>
</reference>
</advisory>
<advisory id="98765">
<link> https://www.msn.com </link>
<release_date>2016-04-08</release_date>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<product>
<id>12654</id>
<name>Product Name</name>
</product>
<language>
<id>126</id>
<name>fr</name>
</language>
</os>
<reference>
<id>00052</id>
<name>Partial</name>
<area>Defined</area>
</reference>
</advisory>
</advisories>
The code leverages Nokogiri::XML::Document but if Nokogiri::XML::Builder will work better for this, I am more than willing to adjust my code accordingly.
I'd handle the first part, of parsing one XML file, like this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<advisories>
<advisory id="12345">
<link> https://www.google.com </link>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<language>
<name>en</name>
</language>
</os>
<reference>
<name>Full</name>
</reference>
</advisory>
<advisory id="98765">
<link> https://www.msn.com </link>
<release_date>2016-04-08</release_date>
<title> The Short Description Would Go Here </title>
<location> Location Name Here </location>
<os>
<language>
<name>fr</name>
</language>
</os>
<reference>
<name>Partial</name>
</reference>
</advisory>
</advisories>
EOT
Note: This has nodes removed because they weren't important to the question. Please remove fluff when asking as it's distracting.
With this being the core of the code:
doc.search('advisory').map{ |advisory|
link = advisory.at('link').text
title = advisory.at('title').text
location = advisory.at('location').text
os_language_name = advisory.at('os > language > name').text
reference_name = advisory.at('reference > name').text
{
link: link,
title: title,
location: location,
os_language_name: os_language_name,
reference_name: reference_name
}
}
That could be DRY'd but was written as an example of what to do.
Running that results in an array of hashes, which would be easily output via CSV:
# => [
{:link=>" https://www.google.com ", :title=>" The Short Description Would Go Here ", :location=>" Location Name Here ", :os_language_name=>"en", :reference_name=>"Full"},
{:link=>" https://www.msn.com ", :title=>" The Short Description Would Go Here ", :location=>" Location Name Here ", :os_language_name=>"fr", :reference_name=>"Partial"}
]
Once you've got that working then fit it into a modified version of your loops to output CSV and read the XML files. This is untested but looks about right:
CSV.open('output_file.csv', 'w',
headers: ['Link', 'Title', 'Location', 'OS Name', 'Reference Name'],
write_headers: true
) do |csv|
Dir.glob("*.xml").each do |file|
xml = Nokogiri::XML(File.read(file))
# parse a file and get the array of hashes
end
# pass the array of hashes to CSV for output
end
Note that you were using a file mode of 'wb'. You rarely need b with CSV as CSV is supposed to be a text format. If you are sure you will encounter binary data then use 'b' also, but that could lead down a path containing dragons.
Also note that this is using read. read is not scalable, which means it doesn't care how big a file is, it's going to try to read it into memory, whether or not it'll actually fit. There are lots of reasons to avoid that, but the best is it'll take your program to its knees. If your XML files could exceed the available free memory for your system then you'll want to rewrite using a SAX parser, which Nokogiri supports. How to do that is a different question.
it was actually an Array of array of hashes. I'm not sure how I ended up there but I was easily able to use array.flatten
Meditate on this:
foo = [] # => []
foo << [{}] # => [[{}]]
foo.flatten # => [{}]
You probably wanted to do this:
foo = [] # => []
foo += [{}] # => [{}]
Any time I have to use flatten I look to see if I can create the array without it being an array of arrays of something. It's not that they're inherently bad, because sometimes they're very useful, but you really wanted an array of hashes so you knew something was wrong and flatten was a cheap way out, but using it also costs more CPU time. It's better to figure out the problem and fix it and end up with faster/more efficient code. (And some will say that's a wasted effort or is premature optimization, but writing efficient code is a very good trait and goal.)

Parsing out contents of XML tag in Ruby

I have an XML, that as I understand it has already been parsed by tags. My goal is to parse all the information that is in the <GetResidentsContactInfoResult> tag. In this tag of the sample xml below there are two records in here which begin each with the Lease PropertyId key. How can I iterate over the <GetResidentsContactInfoResult> tag and print out the key/value pairs for each record? I'm new to Ruby and working with XML files, is this something I can do with Nokogiri?
<?xml version="1.0" encoding="UTF-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soap:Body>
<GetResidentsContactInfoResponse xmlns="http://tempuri.org/">
<GetResidentsContactInfoResult><PropertyResidents><Lease PropertyId="21M" BldgID="00" UnitID="0903" ResiID="3" occustatuscode="P" occustatuscodedescription="Previous" MoveInDate="2016-01-07T00:00:00" MoveOutDate="2016-02-06T00:00:00" LeaseBeginDate="2016-01-07T00:00:00" LeaseEndDate="2017-01-31T00:00:00" MktgSource="DBY" PrimaryEmail="noemail1#fake.com"><Occupant PropertyId="21M" BldgID="00" UnitID="0903" ResiID="3" OccuSeqNo="3444755" OccuFirstName="Efren" OccuLastName="Cerda" Phone2No="(832) 693-9448" ResponsibleFlag="Responsible" /></Lease><Lease PropertyId="21M" BldgID="00" UnitID="0908" ResiID="2" occustatuscode="P" occustatuscodedescription="Previous" MoveInDate="2016-02-20T00:00:00" MoveOutDate="2016-04-25T00:00:00" LeaseBeginDate="2016-02-20T00:00:00" LeaseEndDate="2017-02-28T00:00:00" MktgSource="PW" PrimaryEmail="noemail1#fake.com"><Occupant PropertyId="21M" BldgID="00" UnitID="0908" ResiID="2" OccuSeqNo="3451301" OccuFirstName="Donna" OccuLastName="Mclean" Phone2No="(713) 785-4240" ResponsibleFlag="Responsible" /></Lease></PropertyResidents></GetResidentsContactInfoResult>
</GetResidentsContactInfoResponse>
</soap:Body>
</soap:Envelope>
This uses Nokogiri to find all the GetResidentsContactInfoResponse elements, and then Active Support to convert the inner text to a hash of key-value pairs.
Read "sparklemotion/nokogiri" and "Tutorials" regarding installing and using Nokogiri.
Read "Active Support Core Extensions" about more capabilities of Active Support (though the guide does not include Hash.from_xml). To install it simply do gem install activesupport.
I assume you're fine with Nokogiri as you mentioned it in your question.
If you don't want to use Active Support, consider looking into "Convert a Nokogiri document to a Ruby Hash" as an alternative to the line Hash.from_xml(elm.text):
# Needed in order to use the `Hash.from_xml`
require 'active_support/core_ext/hash/conversions'
def find_key_values(str)
doc = Nokogiri::XML(str)
# Ignore namespaces for easier traversal
doc.remove_namespaces!
doc.css('GetResidentsContactInfoResponse').map do |elm|
Hash.from_xml(elm.text)
end
end
Usage:
# Option 1: if your XML above is stored in a variable called `string`
find_key_values string
# Option 2: if your XML above is stored in a file
find_key_values File.open('/path/to/file')
Which returns:
[{"PropertyResidents"=>
{"Lease"=>
[{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0903",
"ResiID"=>"3",
"occustatuscode"=>"P",
"occustatuscodedescription"=>"Previous",
"MoveInDate"=>"2016-01-07T00:00:00",
"MoveOutDate"=>"2016-02-06T00:00:00",
"LeaseBeginDate"=>"2016-01-07T00:00:00",
"LeaseEndDate"=>"2017-01-31T00:00:00",
"MktgSource"=>"DBY",
"PrimaryEmail"=>"noemail1#fake.com",
"Occupant"=>
{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0903",
"ResiID"=>"3",
"OccuSeqNo"=>"3444755",
"OccuFirstName"=>"Efren",
"OccuLastName"=>"Cerda",
"Phone2No"=>"(832) 693-9448",
"ResponsibleFlag"=>"Responsible"}},
{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0908",
"ResiID"=>"2",
"occustatuscode"=>"P",
"occustatuscodedescription"=>"Previous",
"MoveInDate"=>"2016-02-20T00:00:00",
"MoveOutDate"=>"2016-04-25T00:00:00",
"LeaseBeginDate"=>"2016-02-20T00:00:00",
"LeaseEndDate"=>"2017-02-28T00:00:00",
"MktgSource"=>"PW",
"PrimaryEmail"=>"noemail1#fake.com",
"Occupant"=>
{"PropertyId"=>"21M",
"BldgID"=>"00",
"UnitID"=>"0908",
"ResiID"=>"2",
"OccuSeqNo"=>"3451301",
"OccuFirstName"=>"Donna",
"OccuLastName"=>"Mclean",
"Phone2No"=>"(713) 785-4240",
"ResponsibleFlag"=>"Responsible"}}]}}]

How to deal with namespaces in XML with Ruby and Nokogiri

I have this XML document:
<item>
<title>The Big Bang Theory - Temporada 1</title>
<lge:titleid>season_id4</lge:titleid>
<lge:thumb>https://d12h56qju7t1ah.cloudfront.net/system/artworks/4/h80/the-big-bang-theory-1.20130326114106.jpg?1364298066</lge:thumb>
</item>
and parse it with:
parsed = Nokogiri::XML.parse(File.read(file_doc))
parsed.class returns "Nokogiri::XML::Document".
But when I do:
parsed.at_xpath('lge:titleid').text
I get:
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //lge:titleid
What could be the reason for this and how can I get the text of that particular node?

Resources