regular expressions to strip off all the XML - ruby

I am very new with Ruby and I need to write the ruby regular expressions to strip off all the XML and create a file with titles instead of XML:
for example the first book should be:
book: bk101
author: Mathew Gamardella (notice first name first!!!)
title: XML Developer's Guide
Genre: Computer
Price: 44.95
Publish Date: October 1,2000 (Notice this is different from the XML - you must convert the date to this form)
Description: An in-depth look at creating applications
with XML
Here is my XML file -
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</catalog>
Any help is really appreciated.

This looks more like a homework assignment than a question. I'll let you figure out how to write files and format the date --- here's something simple that will make a hash out of your XML and loop through each book / field one at a time (I shortened your document to two books).
require 'active_support/core_ext/hash'
xml_books = <<-EOF
"<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
</catalog>
EOF
books = Hash.from_xml(xml_books)
books['catalog']['book'].each do |e|
e.keys.each do |k|
printf("%s -> %s\n", k, e[k])
end # do k
end # do e
Produces the following output:
id -> bk101
author -> Gambardella, Matthew
title -> XML Developer's Guide
genre -> Computer
price -> 44.95
publish_date -> 2000-10-01
description -> An in-depth look at creating applications
with XML.
id -> bk102
author -> Ralls, Kim
title -> Midnight Rain
genre -> Fantasy
price -> 5.95
publish_date -> 2000-12-16
description -> A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.

Related

How to combine two XML files with Nokogiri

I am trying to combine two separate, but related, files with Nokogiri. I want to combine the "product" and "product pricing" if "ItemNumber" is the same.
I loaded the documents, but I have no idea how to combine the two.
Product File:
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ProductTypeId>0</ProductTypeId>
<Description>This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.</Description>
<ActiveFlag>Y</ActiveFlag>
<ImageFile>100024.jpg</ImageFile>
<ItemNumber>100024</ItemNumber>
<ProductVariants>
<ProductVariant>
<Sku>100024</Sku>
<ColorName></ColorName>
<SizeName></SizeName>
<SequenceNo>0</SequenceNo>
<BackOrderableFlag>N</BackOrderableFlag>
<InventoryLevel>0</InventoryLevel>
<ColorCode></ColorCode>
<SizeCode></SizeCode>
<TaxableFlag>Y</TaxableFlag>
<VariantPromoGroupCode></VariantPromoGroupCode>
<PricingGroupCode></PricingGroupCode>
<StartDate xsi:nil="true"></StartDate>
<EndDate xsi:nil="true"></EndDate>
<ActiveFlag>Y</ActiveFlag>
</ProductVariant>
</ProductVariants>
</Product>
</Products>
Product Pricing Fields:
<ProductPricing>
<ItemNumber>100024</ItemNumber>
<AcquisitionCost>8.52</AcquisitionCost>
<MemberCost>10.7</MemberCost>
<Price>14.99</Price>
<SalePrice xsi:nil="true"></SalePrice>
<SaleCode>0</SaleCode>
</ProductPricing>
I am looking to generate a file like this:
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ProductTypeId>0</ProductTypeId>
<Description>This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.</Description>
<ActiveFlag>Y</ActiveFlag>
<ImageFile>100024.jpg</ImageFile>
<ItemNumber>100024</ItemNumber>
<ProductVariants>
<ProductVariant>
<Sku>100024</Sku>
<ColorName></ColorName>
<SizeName></SizeName>
<SequenceNo>0</SequenceNo>
<BackOrderableFlag>N</BackOrderableFlag>
<InventoryLevel>0</InventoryLevel>
<ColorCode></ColorCode>
<SizeCode></SizeCode>
<TaxableFlag>Y</TaxableFlag>
<VariantPromoGroupCode></VariantPromoGroupCode>
<PricingGroupCode></PricingGroupCode>
<StartDate xsi:nil="true"></StartDate>
<EndDate xsi:nil="true"></EndDate>
<ActiveFlag>Y</ActiveFlag>
</ProductVariant>
</ProductVariants>
</Product>
<ProductPricing>
<ItemNumber>100024</ItemNumber>
<AcquisitionCost>8.52</AcquisitionCost>
<MemberCost>10.7</MemberCost>
<Price>14.99</Price>
<SalePrice xsi:nil="true"></SalePrice>
<SaleCode>0</SaleCode>
</ProductPricing>
</Products>
Here is the code I have so far:
require 'csv'
require 'nokogiri'
xml = File.read('lateApril-product-pricing.xml')
xml2 = File.read('lateApril-master-date')
doc = Nokogiri::XML(xml)
doc2 = Nokogiri::XML(xml2)
pricing_data = []
item_number = []
doc.xpath('//ProductsPricing/ProductPricing').each do |file|
itemNumber = file.xpath('./ItemNumber').first.text
variant_Price = file.xpath('./Price').first.text
pricing_data << [ itemNumber, variant_Price ]
item_number << [ itemNumber ]
end
puts item_number ## This prints all the item number but i have no idea how to loop through them and combine them with Product XML
doc2.xpath('//Products/Product').each do |file|
itemNumber = file.xpath('./ItemNumber').first.text #not sure how to write the conditions here since i don't have pricing fields available in this method
end
Try this on:
require 'nokogiri'
doc1 = Nokogiri::XML(<<EOT)
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
</Product>
</Products>
EOT
doc2 = Nokogiri::XML(<<EOT)
<ProductPricing>
<ItemNumber>100024</ItemNumber>
</ProductPricing>
EOT
doc1.at('Product').add_next_sibling(doc2.at('ProductPricing'))
Which results in:
puts doc1.to_xml
# >> <?xml version="1.0"?>
# >> <Products>
# >> <Product>
# >> <Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
# >> </Product><ProductPricing>
# >> <ItemNumber>100024</ItemNumber>
# >> </ProductPricing>
# >> </Products>
Please, when you ask, strip the example input and expected resulting output to the absolute, bare, minimum. Anything beyond that wastes space, eye-time and brain CPU.
This is untested code, but is where I'd start if I was going to merge two files containing multiple <ItemNumber> nodes:
require 'nokogiri'
doc1 = Nokogiri::XML(<<EOT)
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ItemNumber>100024</ItemNumber>
</Product>
</Products>
EOT
doc2 = Nokogiri::XML(<<EOT)
<ProductPricing>
<ItemNumber>100024</ItemNumber>
</ProductPricing>
EOT
# build a hash containing the item numbers in doc1 for each product
doc1_products_by_item_numbers = doc1.search('Product').map { |product|
item_number = product.at('ItemNumber').value
[
item_number,
product
]
}.to_hash
# build a hash containing the item numbers in doc2 for each product pricing
doc2_products_by_item_numbers = doc2.search('ProductPricing').map { |pricing|
item_number = pricing.at('ItemNumber').value
[
item_number,
pricing
]
}.to_hash
# append doc2 entries to doc1 after each product based on item numbers
doc1_products_by_item_numbers.keys.each { |k|
doc1_products_by_item_numbers[k].add_next_sibling(doc2_products_by_item_numbers[k])
}

How to add the values for respective elements?

I have 3 XML structures as below:
a.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>May</Month>
<Year>2016</Year>
<BooksReleased>4</BooksReleased>
</Book>
</Books>
b.xml
<Books>
<Book>
<Publisher>XYZ Pvt Ltd</Publisher>
<Month>April</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
c.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>June</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
I would like to group these XML by publisher and also need to calculate its total no. of BooksReleased by the publisher for particular year.
required output format:
<TotalCalc>
<PublishedBook>
<Publisher>ABC Pvt Ltd</Publisher>
<no.of books>6</no.of books>
</PublishedBook>
<PublishedBook>
<Publisher>XYZ Pvt Ltd</Publisher>
<no.of books>2</no.of books>
</PublishedBook>
</TotalCalc>
Kindly, help me i tried the following but its not working
typeswitch($Publisher)
case element (ABC Pvt Ltd)
return sum($doc/BooksReleases[$doc/$Publisher = 'ABC Pvt Ltd'])
default return 'unknnown'
It might be possible to use cts:value-tuples to pull up co-occurrences of Publisher and 'BooksReleased', which you can then iterate to aggregate by Publisher. That would scale much better. Something like:
let $aggregates := map:map()
let $_ :=
for $tuple in cts:value-tuples((
cts:element-reference(xs:QName("Publisher")),
cts:element-reference(xs:QName("BooksReleased"))
))
let $values := json:array-values($tuple)
let $pub := $values[1]
let $books as xs:int := $values[2]
return map:put($aggregates, $pub, (map:get($aggregates, $pub), 0)[1] + $books)
return $aggregates
Note thought that this requires indexes on Publisher and BooksReleased, and it is important that each document contains only one (value of) Publisher to prevent cross-products.
I would also consider simply dropping (or ignoring) BooksReleased, and just making sure you save each book as a separate document. You can then use cts:values on Publisher and use cts:frequency on each publisher value to get the number of books for the publishers.
HTH!

Parsing an XML feed using Nokogiri isn't working

This is my code:
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.blank?
search.each do |data|
title=data.css("title").text
link=data.css("link").text
end
end
but I did not get the link.
Several things are wrong:
if !search.blank?
won't work because search would be a NodeSet returned by doc.css. NodeSet's don't have a blank? method. Perhaps you meant empty??
title=data.css("title").text
isn't the correct way to find the title because, like in the above problem, you're getting a NodeSet instead of a Node. Getting text from a NodeSet can return a lot of garbage you don't want. Instead do:
title=data.at("title").text
Changing the code to this:
require 'nokogiri'
require 'open-uri'
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.empty?
search.each do |data|
title=data.at("title").text
link=data.at("link").text
puts "title: #{ title } link: #{ link }"
end
end
Outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link:
title: Freedom Center Offering Free Admission Monday link:
title: Miami University Band Performing in the Inaugural Parade link:
title: Northern Kentucky Man To Present Colors At Inauguration link:
title: John Gumms Monday Forecast link:
title: President Obama VP Biden sworn in officially begin second terms link:
title: Colerain Township Pizza Hut Robbed Saturday Night link:
title: Cold Snap Coming to Tri-State link:
title: 2 Men Arrested After Police Chase in Northern Kentucky link:
The link won't work because the XML is malformed, which, in my experience, is unbelievably common on the internet because people don't take the time to check their work.
The fix is going to take massaging the XML prior to Nokogiri receiving the content, or to modify your accessors. Luckily, this particular XML is easy to work around so this should help:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search = doc.css('item')
if !search.empty?
search.each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
end
Which outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link: http://www.cincinnatisun.com/index.php/sid/212072454/scat/90d24f4ad98a2793
title: Freedom Center Offering Free Admission Monday link: http://www.cincinnatisun.com/index.php/sid/212072914/scat/90d24f4ad98a2793
title: Miami University Band Performing in the Inaugural Parade link: http://www.cincinnatisun.com/index.php/sid/212072915/scat/90d24f4ad98a2793
title: Northern Kentucky Man To Present Colors At Inauguration link: http://www.cincinnatisun.com/index.php/sid/212072913/scat/90d24f4ad98a2793
title: John Gumms Monday Forecast link: http://www.cincinnatisun.com/index.php/sid/212070535/scat/90d24f4ad98a2793
title: President Obama VP Biden sworn in officially begin second terms link: http://www.cincinnatisun.com/index.php/sid/212060033/scat/90d24f4ad98a2793
title: Colerain Township Pizza Hut Robbed Saturday Night link: http://www.cincinnatisun.com/index.php/sid/212057132/scat/90d24f4ad98a2793
title: Cold Snap Coming to Tri-State link: http://www.cincinnatisun.com/index.php/sid/212057131/scat/90d24f4ad98a2793
title: 2 Men Arrested After Police Chase in Northern Kentucky link: http://www.cincinnatisun.com/index.php/sid/212057130/scat/90d24f4ad98a2793
All that done, you can write your code more clearly like:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
doc.css('item').each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
Interestingly enough, now the sample page appears to have its links fixed.
According to http://nokogiri.org/tutorials/searching_a_xml_html_document.html something like:
#doc = Nokogiri::XML(File.read("feed.xml"))
#doc.xpath('//xmlns:link')
should do the job. But be aware, that your provided xml snippet isn't a valid xml feed at all (no root element, item tag not opened - only closed etc.). The code assumes the xml feed looks i.e.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<item>
<title>Atom-Powered Robots Run Amok</title>
<link>http://example.org/2003/12/13/atom03</link>
</item>
</feed>
And extracts:
<link>http://example.org/2003/12/13/atom03</link>
as result. Please try to to look at the documentation/reference first, if you have problems like this. If you tried something and it didn't worked like you would expect, than you can consult stackoverflow with actual code - that makes it easier to understand your problem & provide help.

Distinct Result via xQuery

I'm trying to get reviewers who review one or more books published after 2010.
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
The following are both XML files.
review.xml:
<Reviews>
<Review>
<ReviewID>R1</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R2</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>BBB</Reviewer>
</Review>
<Review>
<ReviewID>R3</ReviewID>
<BookTitle>B2</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R4</ReviewID>
<BookTitle>B3</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Reviews>
book.xml:
<Books>
<Book>
<Title>B1</Title>
<Year>2005</Year>
</Book>
<Book>
<Title>B2</Title>
<Year>2011</Year>
</Book>
<Book>
<Title>B3</Title>
<Year>2012</Year>
</Book>
</Books>
I'll get two AAA by my xQuery code. I was wondering if I can get the distinct result, which means only one AAA. I've tried distinct-value() but don't know how to use it probably. Thanks for your reply!
----My Updated Solution with XML format for xQuery 1.0----
<root>
{
for $x in distinct-values
(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
)
return <reviewer>{$x}</reviewer>
}
</root>
To preserve nodes, you can use the "group by" clause and select the first item of a group sequence:
for $r in doc("review.xml")//Review,
$b in doc("book.xml")//Book
let $n := $r/Reviewer
where $b/Title = $r/BookTitle
and $b/Year > 2010
group by $n
return $r[1]/Reviewer
The following query will give you all distint reviewer names (note that the values are atomized, which means the element nodes are removed):
distinct-values(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return $r/Reviewer
)

Trying to parse a XML using Nokogiri with Ruby

I am new to programming so bear with me. I have an XML document that looks like this:
File name: PRIDE1542.xml
<ExperimentCollection version="2.1">
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>**Protein complexes in Saccharomyces cerevisiae (GPM06600002310)**</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>**None**</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>**GPM06600002310**</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
</sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
</ExperimentCollection>
I want to take out the text in between <Title>, <ProtocolName>, and <SampleName> and put into a text file (I tried bolding them to making it easier to see). I have the following code so far (based on posts I saw on this site), but it seems not to work:
>> require 'rubygems'
>> require 'nokogiri'
>> doc = Nokogiri::XML(File.open("PRIDE_Exp_Complete_Ac_10094.xml"))
>> #ExperimentCollection = doc.css("ExperimentCollection Title").map {|node| node.children.text }
Can someone help me?
Try to access them using xpath expressions. You can enter the path through the parse tree using slashes.
puts doc.xpath( "/ExperimentCollection/Experiment/Title" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/Protocol/ProtocolName" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleName" ).text

Resources