Distinct Result via xQuery

Distinct Result via xQuery - distinct

I'm trying to get reviewers who review one or more books published after 2010.
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
The following are both XML files.
review.xml:
<Reviews>
<Review>
<ReviewID>R1</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R2</ReviewID>
<BookTitle>B1</BookTitle>
<Reviewer>BBB</Reviewer>
</Review>
<Review>
<ReviewID>R3</ReviewID>
<BookTitle>B2</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Review>
<ReviewID>R4</ReviewID>
<BookTitle>B3</BookTitle>
<Reviewer>AAA</Reviewer>
</Review>
<Reviews>
book.xml:
<Books>
<Book>
<Title>B1</Title>
<Year>2005</Year>
</Book>
<Book>
<Title>B2</Title>
<Year>2011</Year>
</Book>
<Book>
<Title>B3</Title>
<Year>2012</Year>
</Book>
</Books>
I'll get two AAA by my xQuery code. I was wondering if I can get the distinct result, which means only one AAA. I've tried distinct-value() but don't know how to use it probably. Thanks for your reply!
----My Updated Solution with XML format for xQuery 1.0----
<root>
{
for $x in distinct-values
(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return {$r/Reviewer}
)
return <reviewer>{$x}</reviewer>
}
</root>

To preserve nodes, you can use the "group by" clause and select the first item of a group sequence:
for $r in doc("review.xml")//Review,
$b in doc("book.xml")//Book
let $n := $r/Reviewer
where $b/Title = $r/BookTitle
and $b/Year > 2010
group by $n
return $r[1]/Reviewer

The following query will give you all distint reviewer names (note that the values are atomized, which means the element nodes are removed):
distinct-values(
for $r in doc("review.xml")//Reviews//Review,
$b in doc("book.xml")//Books//Book
where $b/Title = $r/BookTitle
and $b/Year > 2010
return $r/Reviewer
)

Related

how to read the data from XML with spaces using oracle

I want to read the data from passage_para tag, after passage_para I have 2 spaces before the expression tag and after the expression tag I have one more space, etc. When I use extract function to get the passage_para tag from the XMLTYPE column it is eliminating all the spaces.
<?xml version="1.0" encoding="UTF-8"?> <item> <information number="sdjsadh" > <response_direction delivery_mode="xcs"> <dparagraph>test</dparagraph> </response_direction> </information> <i_content> <stimulus_reference> <passage> <prose style="1"> <passage_para> <expression> <math xmlns="Math" xmlns:xlink="xlink" display="inline" overflow="scroll"> <mr> <mi>z</mi> <mo>></mo> <mn>0</mn> </mr> </math> </expression> </passage_para> </prose> </passage> </stimulus_reference> </i_content> </item>
which I don't want because it is taking out the spaces. The desired output I need is " z > 0 ".
Note: Between the passage_para tag the child nodes may change, they are not going to be the same.

How to add the values for respective elements?

I have 3 XML structures as below:
a.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>May</Month>
<Year>2016</Year>
<BooksReleased>4</BooksReleased>
</Book>
</Books>
b.xml
<Books>
<Book>
<Publisher>XYZ Pvt Ltd</Publisher>
<Month>April</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
c.xml
<Books>
<Book>
<Publisher>ABC Pvt Ltd</Publisher>
<Month>June</Month>
<Year>2016</Year>
<BooksReleased>2</BooksReleased>
</Book>
</Books>
I would like to group these XML by publisher and also need to calculate its total no. of BooksReleased by the publisher for particular year.
required output format:
<TotalCalc>
<PublishedBook>
<Publisher>ABC Pvt Ltd</Publisher>
<no.of books>6</no.of books>
</PublishedBook>
<PublishedBook>
<Publisher>XYZ Pvt Ltd</Publisher>
<no.of books>2</no.of books>
</PublishedBook>
</TotalCalc>
Kindly, help me i tried the following but its not working
typeswitch($Publisher)
case element (ABC Pvt Ltd)
return sum($doc/BooksReleases[$doc/$Publisher = 'ABC Pvt Ltd'])
default return 'unknnown'

It might be possible to use cts:value-tuples to pull up co-occurrences of Publisher and 'BooksReleased', which you can then iterate to aggregate by Publisher. That would scale much better. Something like:
let $aggregates := map:map()
let $_ :=
for $tuple in cts:value-tuples((
cts:element-reference(xs:QName("Publisher")),
cts:element-reference(xs:QName("BooksReleased"))
))
let $values := json:array-values($tuple)
let $pub := $values[1]
let $books as xs:int := $values[2]
return map:put($aggregates, $pub, (map:get($aggregates, $pub), 0)[1] + $books)
return $aggregates
Note thought that this requires indexes on Publisher and BooksReleased, and it is important that each document contains only one (value of) Publisher to prevent cross-products.
I would also consider simply dropping (or ignoring) BooksReleased, and just making sure you save each book as a separate document. You can then use cts:values on Publisher and use cts:frequency on each publisher value to get the number of books for the publishers.
HTH!

I want to extract specific information from XML input using Hadoop Pig Latin

Expected output is: (Hadoop definitive guide,Tom white,24.90).
I have tried using the Regex_Extract() function. But, no luck yet. Can someone please help me out?
The input to my script is:
<CATALOG>
<BOOK>
<TITLE>Hadoop DEFINITIVE GUIDE</TITLE>
<AUTHOR>TOM WHITE</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>

You will have to extract <TITLE>, <AUTHOR> and <PRICE> separately and then join them together using JOIN operator.
Following script achieves that:
-- Load input
A = LOAD '/input.txt' USING PigStorage() AS (f1:chararray);
-- Extract <TITLE>
B1 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<TITLE>(.*)</TITLE>', 1) AS (title:chararray);
C1 = FILTER B1 BY title is not null;
D1 = RANK C1;
-- Extract <AUTHOR>
B2 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<AUTHOR>(.*)</AUTHOR>', 1) AS (author:chararray);
C2 = FILTER B2 BY author is not null;
D2 = RANK C2;
-- Extract <PRICE>
B3 = FOREACH A GENERATE REGEX_EXTRACT(f1, '<PRICE>(.*)</PRICE>', 1) AS (price:chararray);
C3 = FILTER B3 BY price is not null;
D3 = RANK C3;
-- Join 3 data sets
D = JOIN D1 BY $0, D2 BY $0, D3 By $0;
-- Eliminate the ranks
E = FOREACH D GENERATE $1 AS (title:chrarray), $3 AS (author:chararray), $5 AS (price:chararray)
dump E;
For the input mentioned in the question, I got the following output:
(Hadoop DEFINITIVE GUIDE,TOM WHITE,24.90)
(Programming Pig,Alan Gates,30.90)

Unable to findnodes() restricted just to current parent

I'm parsing a simple XML file to create a flat text file from it. The desired outcome is shown below the sample XML. The XML has sort of a header-detail structure (Assembly_Info and Part respectively), with a unique header node followed by any number of detail record nodes, all of which are siblings. After digging into the elements under the header, I can't then find a way back 'up' to then pick up all the sibling detail nodes.
XML file looks like this:
<?xml version="1.0" standalone="yes" ?>
<Wrapper>
<Record>
<Product>
<prodid>4094</prodid>
</Product>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0000</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0455</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>045A</dev_name>
</Part>
</Assembly>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0002</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0457</dev_name>
</Part>
</Assembly>
</Record>
</Wrapper>
For each Assembly I need to read the values of the two elemenmets in Assembly_Info which I do successfully. But, I then want to read each of the Part records that are associated with the Assembly. The objective is to 'flatten' the file into this:
prodid id interface status dev_name
4094 DF-7A C N/A 0000
4094 DF-7A C Ready 0455
4094 DF-7A C Ready 045A
4094 DF-7A C N/A 0002
4094 DF-7A C Ready 0457
I'm attempting to use findnodes() to do this, as that's about the only tool I thought I understood. My code unfortunately reads all of the Part records from the entire file foreach Assembly--since the only way I've been able to find the Part nodes is to start at the root. I don't know how to change 'where I am', if you will; to tell findnodes to begin at current parent. Code looks like this:
my $parser = XML::LibXML -> new();
my $tree = $parser -> parse_file ('DEMO.XML');
for my $product ($tree->findnodes ('/Wrapper/Record/Product/prodid')) {
$prodid = $product->textContent();
}
foreach my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly')){
$assemblies++;
$parts = 0;
for my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly/Assembly_Info')) {
$id = $assembly->findvalue('id');
$interface = $assembly->findvalue('interface');
}
foreach my $part ($tree->findnodes ('/Wrapper/Record/Assembly/Part')) {
$parts++;
$status = $part->findvalue('status');
$dev_name = $part->findvalue('dev_name');
}
print "Assembly No: ", $assemblies, " Parts: ",$parts, "\n";
}
How do I get just the Part nodes for a given Assembly, after I've gone down to the Assembly_Info depths? There is quite a bit I'm not getting, and I think a problem may be that I'm thinking of this as 'navigating' or moving a cursor, if you will. Examples of XPath path expressions have not helped me.

Instead of always using $tree as the starting point for the findnodes method, you can use any other node, especially also child nodes. Then you could use a relative XPath expression. For example:
for my $record ($tree->findnodes('/Wrapper/Record')) {
for my $assembly ($record->findnodes('./Assembly')) {
for my $part ($assembly->findnodes('./Part')) {
}
}
}

libxml2 predicates in xpath expression are not always recognized

I appeal to you because I have problems in using the libxml2 library that does not take into account certain parameters in my xpath expressions.
Here is an example of xml file that I am trying to parse:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book title="Harry Potter" lang="eng" version="1">
<price>29.99</price>
</book>
<book title="Learning XML" lang="eng" version="2">
<price>38.95</price>
</book>
<book title="Learning C" lang="eng" version="2">
<price>39.95</price>
</book>
</bookstore>
Suppose I want to extract all the books whose native language is English and whose version is the first edition.
I'll use if I'm not mistaken the following XPath expression :
//book[#lang='eng' and #version='1']
and the following instructions in my code :
xmlChar * xpath_expression = "//book[#lang='eng' and #version='1']";
xmlXPathObjectPtr xpathRes = xmlXPathEvalExpression(xpath_expression, ctxt);
The problem is that I get as a result, the list of books as if I'd just do the following request:
//book
I wonder if my version is buggy knowing that I have the latest for my debian squeeze (2.7.8.dfsg-2 + squeeze7)...

This is most certainly not a bug in libxml2. You probably made an error elsewhere. The following code only prints "Harry Potter":
#include <stdio.h>
#include <libxml/xpath.h>
int main()
{
static const char xml[] =
"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n"
"<bookstore>\n"
" <book title=\"Harry Potter\" lang=\"eng\" version=\"1\">\n"
" <price>29.99</price>\n"
" </book>\n"
" <book title=\"Learning XML\" lang=\"eng\" version=\"2\">\n"
" <price>38.95</price>\n"
" </book>\n"
" <book title=\"Learning C\" lang=\"eng\" version=\"2\"> \n"
" <price>39.95</price>\n"
" </book>\n"
"</bookstore>\n";
xmlDocPtr doc = xmlParseMemory(xml, sizeof(xml));
xmlXPathContextPtr ctxt = xmlXPathNewContext(doc);
xmlChar *expression = BAD_CAST "//book[#lang='eng' and #version='1']";
xmlXPathObjectPtr res = xmlXPathEvalExpression(expression, ctxt);
xmlNodeSetPtr nodeset = res->nodesetval;
for (int i = 0; i < nodeset->nodeNr; i++) {
xmlNodePtr node = nodeset->nodeTab[i];
xmlChar *title = xmlGetProp(node, BAD_CAST "title");
printf("%s\n", title);
}
xmlXPathFreeObject(res);
xmlXPathFreeContext(ctxt);
xmlFreeDoc(doc);
return 0;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Distinct Result via xQuery - distinct

To preserve nodes, you can use the "group by" clause and select the first item of a group sequence: for $r in doc("review.xml")//Review, $b in doc("book.xml")//Book let $n := $r/Reviewer where $b/Title = $r/BookTitle and $b/Year > 2010 group by $n return $r[1]/Reviewer

Related

how to read the data from XML with spaces using oracle

How to add the values for respective elements?

I want to extract specific information from XML input using Hadoop Pig Latin

Unable to findnodes() restricted just to current parent

libxml2 predicates in xpath expression are not always recognized

Categories

Resources