XPath for stackoverflow dump files - xpath

Am working with file with following format:
<badges>
<row Id="1" UserId="1" Name="Teacher" Date="2009-09-30T15:17:50.66"/>
<row Id="2" UserId="3" Name="Teacher" Date="2009-09-30T15:17:50.69"/>
</badges>
I am using pig xmlloader to fetch this xml data into hdfs.
A = LOAD '/badges' using org.apache.pig.piggybank.storage.XMLLoader('row') as (x:chararray);
B = foreach A generate xpath(x, '/row#Id').
Dump B;
Output I get () - No values.
I want the file output as text i.e 1,1,Teacher,2009-09-30T15:17:50.66. How can I do this?

I'm not familiar with pig xmlloader, but /row#Id has two problems:
It's not valid XPath
If it were, it would be an absolute path
Try:
B = foreach A generate xpath(x, 'row/#Id').
It uses valid syntax and a relative path.

Use XPathAll for extracting attributes.Xpath has an issue when it comes to attributes.
REGISTER '/path/piggybank-0.15.0.jar'; -- Use the jar name you downloaded
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
B = foreach A generate XPathAll(x, 'row/#Id', true, false).$0 as (id:chararray);

Related

Why header is automatically skipped in output file

I want to storage my data without skipping data header
This is my pig script :
CRE_GM05 = LOAD '$input1' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,T32_001:chararray,TEC_013:chararray,TEC_014:chararray,DAT_001_X:chararray,DAT_002_X:chararray,TEC_001:chararray);
CRE_GM11 = LOAD '$input2' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,DAT_001_X:chararray,DAT_002_X:chararray,D08_001:chararray,PSE_001:chararray,PSE_002:chararray,PSE_003:chararray,RUB_001:chararray,RUB_002:chararray,RUB_003:chararray,RUB_004:chararray,RUB_005:chararray,RUB_006:chararray,RUB_007:chararray,RUB_008:chararray,RUB_009:chararray,RUB_010:chararray,TEC_001:chararray,TEC_002:chararray,TEC_003:chararray,TX_001_VLR:chararray,TX_001_DCM:chararray,D08_004:chararray,D11_004:chararray,RUB_016:chararray,T03_001:chararray);
-- Effectuer une jointure entre les deux tables
JOINED_TABLES = JOIN CRE_GM05 BY TEC_001, CRE_GM11 BY TEC_001;
-- Generer les colonnes
DATA_GM05 = FOREACH JOINED_TABLES GENERATE
CRE_GM05::MGM_COMPTEUR AS MGM_COMPTEUR,
CRE_GM05::CIA_CD_CRV_CIA AS CIA_CD_CRV_CIA,
CRE_GM05::CIA_DA_EM_CRV AS CIA_DA_EM_CRV,
CRE_GM05::CIA_CD_CTRL_BLCE AS CIA_CD_CTRL_BLCE,
CRE_GM05::CIA_IDC_EXTR_RDJ AS CIA_IDC_EXTR_RDJ,
CRE_GM05::CIA_VLR_IDT_CRV_LOQ AS CIA_VLR_IDT_CRV_LOQ,
CRE_GM05::CIA_VLR_REF_CRV AS CIA_VLR_REF_CRV,
CRE_GM05::CIA_VLR_LG_ZON_RTG AS CIA_VLR_LG_ZON_RTG,
CRE_GM05::CIA_HEU_CIA AS CIA_HEU_CIA,
CRE_GM05::CIA_TM_STP_CRE AS CIA_TM_STP_CRE,
CRE_GM05::CIA_VLR_1 AS CIA_VLR_1,
CRE_GM05::CIA_DA_ARR_FIC AS CIA_DA_ARR_FIC,
CRE_GM05::CIA_TY_ENR AS CIA_TY_ENR,
CRE_GM05::CIA_CD_BTE AS CIA_CD_BTE,
CRE_GM05::CIA_CD_PER AS CIA_CD_PER,
CRE_GM05::CIA_CD_EFS AS CIA_CD_EFS,
CRE_GM05::CIA_CD_ETA_VAL_CRV AS CIA_CD_ETA_VAL_CRV,
CRE_GM05::CIA_CD_EVE_CPR AS CIA_CD_EVE_CPR,
CRE_GM05::CIA_CD_APLI_TDU AS CIA_CD_APLI_TDU,
CRE_GM05::CIA_CD_STE_RTG AS CIA_CD_STE_RTG,
CRE_GM05::CIA_DA_TT_RTG AS CIA_DA_TT_RTG,
CRE_GM05::CIA_NO_ENR_RTG AS CIA_NO_ENR_RTG,
CRE_GM05::CIA_DA_VAL_EVE AS CIA_DA_VAL_EVE,
CRE_GM05::T32_001 AS T32_001,
CRE_GM05::TEC_013 AS TEC_013,
CRE_GM05::TEC_014 AS TEC_014,
CRE_GM05::DAT_001_X AS DAT_001_X,
CRE_GM05::DAT_002_X AS DAT_002_X,
CRE_GM05::TEC_001 AS TEC_001;
STORE DATA_GM05 INTO '$OUTPUT_FILE' USING PigStorage(';');
It returns data but I lost the first line of headers !
Note that my $input1 and $input2 variables are csv files
I tried using CSVLoader but it doesn't working also.
I need to get output stored with headers please
In pig final output by default there is no headers coming. Also adding header to final output will doesn't make any sense as sequence of rows is not fixed in pig output.
If you want to add header to final output, either merge all the part files data to a file in local file system where you can add header information explicitly or use hive table to store the output of this pig script. There is HCatlog store can be used for same.

Stanford NLP Coref Resolution for Conversational Data

I want to make some experiments with Stanford dcoref package on our conversational data. Our data contains usernames (speakers) and the utterances. Is it possible to give a structured data as input (instead of the raw text) to Stanford dcoref annotator? If yes, what should be the format of conversational input data?
Thank you,
-berfin
I was able to get this basic example to work:
<doc id="speaker-example-1">
<post author="Joe Smith" datetime="2018-02-28T20:10:00" id="p1">
I am hungry!
</post>
<post author="Jane Smith" datetime="2018-02-28T20:10:05" id="p2">
Joe Smith is hungry.
</post>
</doc>
I used these properties:
annotators = tokenize,cleanxml,ssplit,pos,lemma,ner,parse,coref
coref.conll = true
coref.algorithm = clustering
# Clean XML tags for SGM (move to sgm specific conf file?)
clean.xmltags = headline|dateline|text|post
clean.singlesentencetags = HEADLINE|DATELINE|SPEAKER|POSTER|POSTDATE
clean.sentenceendingtags = P|POST|QUOTE
clean.turntags = TURN|POST|QUOTE
clean.speakertags = SPEAKER|POSTER
clean.docIdtags = DOCID
clean.datetags = DATETIME|DATE|DATELINE
clean.doctypetags = DOCTYPE
clean.docAnnotations = docID=doc[id],doctype=doc[type],docsourcetype=doctype[source]
clean.sectiontags = HEADLINE|DATELINE|POST
clean.sectionAnnotations = sectionID=post[id],sectionDate=post[date|datetime],sectionDate=postdate,author=post[author],author=poster
clean.quotetags = quote
clean.quoteauthorattributes = orig_author
clean.tokenAnnotations = link=a[href],speaker=post[author],speaker=quote[orig_author]
clean.ssplitDiscardTokens = \\n|\\*NL\\*
Also this document has great info on the coref system:
https://stanfordnlp.github.io/CoreNLP/coref.html
I am looking into using the neural option on my example .xml document, but you might have to put your data into the conll format to run our neural coref with the conll settings. The conll data has conversational data with speaker info among other document formats.
This document contains info on the CoNLL format you'd have to use for the neural algorithm to work.
CoNLL 2012 format: http://conll.cemantix.org/2012/data.html
You need to create a folder with a similar directory structure (but you can put your files in instead)
example:
/Path/to/conll_2012_dir/v9/data/test/data/english/annotations/wb/eng/00/eng_0009.v9_auto_conll
If you run this command:
java -Xmx20g edu.stanford.nlp.coref.CorefSystem -props speaker.properties
with these properties:
coref.algorithm = clustering
coref.conll = true
coref.conllOutputPath = /Path/to/output_dir
coref.data = /Path/to/conll_2012_dir
it will write conll output files to /Path/to/output_dir
That command should read in all files ending with _auto_conll

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

How to extract xml attributes using Xpath in Pig?

I wanted to extract the attributes form an xml using Pig Latin.
This is a sample of the xml file
<CATALOG>
<BOOK>
<TITLE test="test1">Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
</CATALOG>
I used this script but it didn't work:
REGISTER ./piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD './books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'BOOK/TITLE/#test'), XPath(x, 'BOOK/PRICE');
dump B;
The output was:
(,24.90)
I hope someone can help me with this.
Thanks.
There are 2 bugs in piggybank's XPath class:
The ignoreNamespace logic breaks searching for XML attributes
https://issues.apache.org/jira/browse/PIG-4751
The ignoreNamepace parameter is defaulted to true and cannot be overwritten
https://issues.apache.org/jira/browse/PIG-4752
Here is my workaround using XPathAll:
XPathAll(x, 'BOOK/TITLE/#test', true, false).$0 as (test:chararray)
Also if you still need to ignore namespaces:
XPathAll(x, '//*[local-name()=\'BOOK\']//*[local-name()=\'TITLE\']/#test', true, false).$0 as (test:chararray)

How to have a file as schema for pig?

So I have the following code:
My ultimate goal is to have the file produced be the schema for another input file. Is this possible? The output of the current script looks like this:
%declare INPUT '$input'
%declare SCHEMA '$schema'
%declare OUTPUT '$output'
%declare DEL '$del'
%declare COL ':'
%declare COM ','
A = LOAD '$SCHEMA' using PigStorage('$DEL') AS (field:chararray, dataType:chararray, flag:chararray, chars:chararray);
B = FOREACH A GENERATE CONCAT(field,CONCAT('$COL',CONCAT(REPLACE(REPLACE(dataType, 'decimal','double'), 'string', 'chararray'),'$COM')));
rmf $OUTPUT
STORE B INTO '$OUTPUT';
Not sure the right approach.
Here is the output:
record_id:chararray,
offer_id:double,
decision_id:double,
offer_type_cd:integer,
promo_id:double,
pymt_method_type_cd:double,
cs_result_id:double,
cs_result_usage_type_cd:double,
rate_index_type_cd:double,
sub_product_id:double,
campaign_id:double,
market_cell_id:double,
assigned_offer_id:chararray,
accepted_offer_flag:chararray,
current_offer_flag:chararray,
offer_good_until_date:chararray,
Of course, You can use the run command of pig to run the script and do the stuff as needed by you, For more explanation of how to do, refer this link
Hope it helps!

Resources