Distinct Result via xQuery - distinct

I'm trying to get reviewers who review one or more books published after 2010.
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
The following are both XML files.
review.xml:
<Reviews>
<Review>
<ReviewID>R1</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R2</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>BBB</Reviewer>
</Review>
<Review>
<ReviewID>R3</ReviewID>
<BookTitle>B2</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R4</ReviewID>
<BookTitle>B3</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Reviews>
book.xml:
<Books>
<Book>
<Title>B1</Title>
<Year>2005</Year>
</Book>
<Book>
<Title>B2</Title>
<Year>2011</Year>
</Book>
<Book>
<Title>B3</Title>
<Year>2012</Year>
</Book>
</Books>
I'll get two AAA by my xQuery code. I was wondering if I can get the distinct result, which means only one AAA. I've tried distinct-value() but don't know how to use it probably. Thanks for your reply!
----My Updated Solution with XML format for xQuery 1.0----
<root>
{
for $x in distinct-values
(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
)
return <reviewer>{$x}</reviewer>
}
</root>

To preserve nodes, you can use the "group by" clause and select the first item of a group sequence:
for $r in doc("review.xml")//Review,
$b in doc("book.xml")//Book
let $n := $r/Reviewer
where $b/Title = $r/BookTitle
and $b/Year > 2010
group by $n
return $r[1]/Reviewer

The following query will give you all distint reviewer names (note that the values are atomized, which means the element nodes are removed):
distinct-values(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return $r/Reviewer
)

Related

how to read the data from XML with spaces using oracle

I want to read the data from passage_para tag, after passage_para I have 2 spaces before the expression tag and after the expression tag I have one more space, etc. When I use extract function to get the passage_para tag from the XMLTYPE column it is eliminating all the spaces.
<?xml version="1.0" encoding="UTF-8"?> <item> <information number="sdjsadh" > <response_direction delivery_mode="xcs"> <dparagraph>test</dparagraph> </response_direction> </information> <i_content> <stimulus_reference> <passage> <prose style="1"> <passage_para> <expression> <math xmlns="Math" xmlns:xlink="xlink" display="inline" overflow="scroll"> <mr> <mi>z</mi> <mo>></mo> <mn>0</mn> </mr> </math> </expression> </passage_para> </prose> </passage> </stimulus_reference> </i_content> </item>
which I don't want because it is taking out the spaces. The desired output I need is " z > 0 ".
Note: Between the passage_para tag the child nodes may change, they are not going to be the same.

How to add the values for respective elements?

I have 3 XML structures as below:
a.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>May</Month>
<Year>2016</Year>
<BooksReleased>4</BooksReleased>
</Book>
</Books>
b.xml
<Books>
<Book>
<Publisher>XYZ Pvt Ltd</Publisher>
<Month>April</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
c.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>June</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
I would like to group these XML by publisher and also need to calculate its total no. of BooksReleased by the publisher for particular year.
required output format:
<TotalCalc>
<PublishedBook>
<Publisher>ABC Pvt Ltd</Publisher>
<no.of books>6</no.of books>
</PublishedBook>
<PublishedBook>
<Publisher>XYZ Pvt Ltd</Publisher>
<no.of books>2</no.of books>
</PublishedBook>
</TotalCalc>
Kindly, help me i tried the following but its not working
typeswitch($Publisher)
case element (ABC Pvt Ltd)
return sum($doc/BooksReleases[$doc/$Publisher = 'ABC Pvt Ltd'])
default return 'unknnown'
It might be possible to use cts:value-tuples to pull up co-occurrences of Publisher and 'BooksReleased', which you can then iterate to aggregate by Publisher. That would scale much better. Something like:
let $aggregates := map:map()
let $_ :=
for $tuple in cts:value-tuples((
cts:element-reference(xs:QName("Publisher")),
cts:element-reference(xs:QName("BooksReleased"))
))
let $values := json:array-values($tuple)
let $pub := $values[1]
let $books as xs:int := $values[2]
return map:put($aggregates, $pub, (map:get($aggregates, $pub), 0)[1] + $books)
return $aggregates
Note thought that this requires indexes on Publisher and BooksReleased, and it is important that each document contains only one (value of) Publisher to prevent cross-products.
I would also consider simply dropping (or ignoring) BooksReleased, and just making sure you save each book as a separate document. You can then use cts:values on Publisher and use cts:frequency on each publisher value to get the number of books for the publishers.
HTH!

I want to extract specific information from XML input using Hadoop Pig Latin

Expected output is: (Hadoop definitive guide,Tom white,24.90).
I have tried using the Regex_Extract() function. But, no luck yet. Can someone please help me out?
The input to my script is:
<CATALOG>
<BOOK>
<TITLE>Hadoop DEFINITIVE GUIDE</TITLE>
<AUTHOR>TOM WHITE</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>
You will have to extract <TITLE>, <AUTHOR> and <PRICE> separately and then join them together using JOIN operator.
Following script achieves that:
-- Load input
A = LOAD '/input.txt' USING PigStorage() AS (f1:chararray);
-- Extract <TITLE>
B1 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<TITLE>(.*)</TITLE>', 1) AS (title:chararray);
C1 = FILTER B1 BY title is not null;
D1 = RANK C1;
-- Extract <AUTHOR>
B2 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<AUTHOR>(.*)</AUTHOR>', 1) AS (author:chararray);
C2 = FILTER B2 BY author is not null;
D2 = RANK C2;
-- Extract <PRICE>
B3 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<PRICE>(.*)</PRICE>', 1) AS (price:chararray);
C3 = FILTER B3 BY price is not null;
D3 = RANK C3;
-- Join 3 data sets
D = JOIN D1 BY $0, D2 BY $0, D3 By $0;
-- Eliminate the ranks
E = FOREACH D GENERATE $1 AS (title:chrarray), $3 AS (author:chararray), $5 AS (price:chararray)
dump E;
For the input mentioned in the question, I got the following output:
(Hadoop DEFINITIVE GUIDE,TOM WHITE,24.90)
(Programming Pig,Alan Gates,30.90)

Unable to findnodes() restricted just to current parent

I'm parsing a simple XML file to create a flat text file from it. The desired outcome is shown below the sample XML. The XML has sort of a header-detail structure (Assembly_Info and Part respectively), with a unique header node followed by any number of detail record nodes, all of which are siblings. After digging into the elements under the header, I can't then find a way back 'up' to then pick up all the sibling detail nodes.
XML file looks like this:
<?xml version="1.0" standalone="yes" ?>
<Wrapper>
<Record>
<Product>
<prodid>4094</prodid>
</Product>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0000</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0455</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>045A</dev_name>
</Part>
</Assembly>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0002</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0457</dev_name>
</Part>
</Assembly>
</Record>
</Wrapper>
For each Assembly I need to read the values of the two elemenmets in Assembly_Info which I do successfully. But, I then want to read each of the Part records that are associated with the Assembly. The objective is to 'flatten' the file into this:
prodid id interface status dev_name
4094 DF-7A C N/A 0000
4094 DF-7A C Ready 0455
4094 DF-7A C Ready 045A
4094 DF-7A C N/A 0002
4094 DF-7A C Ready 0457
I'm attempting to use findnodes() to do this, as that's about the only tool I thought I understood. My code unfortunately reads all of the Part records from the entire file foreach Assembly--since the only way I've been able to find the Part nodes is to start at the root. I don't know how to change 'where I am', if you will; to tell findnodes to begin at current parent. Code looks like this:
my $parser = XML::LibXML -> new();
my $tree = $parser -> parse_file ('DEMO.XML');
for my $product ($tree->findnodes ('/Wrapper/Record/Product/prodid')) {
$prodid = $product->textContent();
}
foreach my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly')){
$assemblies++;
$parts = 0;
for my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly/Assembly_Info')) {
$id = $assembly->findvalue('id');
$interface = $assembly->findvalue('interface');
}
foreach my $part ($tree->findnodes ('/Wrapper/Record/Assembly/Part')) {
$parts++;
$status = $part->findvalue('status');
$dev_name = $part->findvalue('dev_name');
}
print "Assembly No: ", $assemblies, " Parts: ",$parts, "\n";
}
How do I get just the Part nodes for a given Assembly, after I've gone down to the Assembly_Info depths? There is quite a bit I'm not getting, and I think a problem may be that I'm thinking of this as 'navigating' or moving a cursor, if you will. Examples of XPath path expressions have not helped me.
Instead of always using $tree as the starting point for the findnodes method, you can use any other node, especially also child nodes. Then you could use a relative XPath expression. For example:
for my $record ($tree->findnodes('/Wrapper/Record')) {
for my $assembly ($record->findnodes('./Assembly')) {
for my $part ($assembly->findnodes('./Part')) {
}
}
}

libxml2 predicates in xpath expression are not always recognized

I appeal to you because I have problems in using the libxml2 library that does not take into account certain parameters in my xpath expressions.
Here is an example of xml file that I am trying to parse:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book title="Harry Potter" lang="eng" version="1">
<price>29.99</price>
</book>
<book title="Learning XML" lang="eng" version="2">
<price>38.95</price>
</book>
<book title="Learning C" lang="eng" version="2">
<price>39.95</price>
</book>
</bookstore>
Suppose I want to extract all the books whose native language is English and whose version is the first edition.
I'll use if I'm not mistaken the following XPath expression :
//book[#lang='eng' and #version='1']
and the following instructions in my code :
xmlChar * xpath_expression = "//book[#lang='eng' and #version='1']";
xmlXPathObjectPtr xpathRes = xmlXPathEvalExpression(xpath_expression, ctxt);
The problem is that I get as a result, the list of books as if I'd just do the following request:
//book
I wonder if my version is buggy knowing that I have the latest for my debian squeeze (2.7.8.dfsg-2 + squeeze7)...
This is most certainly not a bug in libxml2. You probably made an error elsewhere. The following code only prints "Harry Potter":
#include <stdio.h>
#include <libxml/xpath.h>
int main()
{
static const char xml[] =
"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n"
"<bookstore>\n"
" <book title=\"Harry Potter\" lang=\"eng\" version=\"1\">\n"
" <price>29.99</price>\n"
" </book>\n"
" <book title=\"Learning XML\" lang=\"eng\" version=\"2\">\n"
" <price>38.95</price>\n"
" </book>\n"
" <book title=\"Learning C\" lang=\"eng\" version=\"2\"> \n"
" <price>39.95</price>\n"
" </book>\n"
"</bookstore>\n";
xmlDocPtr doc = xmlParseMemory(xml, sizeof(xml));
xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlChar *expression = BAD_CAST "//book[#lang='eng' and #version='1']";
xmlXPathObjectPtr res = xmlXPathEvalExpression(expression, ctxt);
xmlNodeSetPtr nodeset = res->nodesetval;
for (int i = 0; i < nodeset->nodeNr; i++) {
xmlNodePtr node = nodeset->nodeTab[i];
xmlChar *title = xmlGetProp(node, BAD_CAST "title");
printf("%s\n", title);
}
xmlXPathFreeObject(res);
xmlXPathFreeContext(ctxt);
xmlFreeDoc(doc);
return 0;
}

Resources