Why header is automatically skipped in output file - hadoop

I want to storage my data without skipping data header
This is my pig script :
CRE_GM05 = LOAD '$input1' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,T32_001:chararray,TEC_013:chararray,TEC_014:chararray,DAT_001_X:chararray,DAT_002_X:chararray,TEC_001:chararray);
CRE_GM11 = LOAD '$input2' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,DAT_001_X:chararray,DAT_002_X:chararray,D08_001:chararray,PSE_001:chararray,PSE_002:chararray,PSE_003:chararray,RUB_001:chararray,RUB_002:chararray,RUB_003:chararray,RUB_004:chararray,RUB_005:chararray,RUB_006:chararray,RUB_007:chararray,RUB_008:chararray,RUB_009:chararray,RUB_010:chararray,TEC_001:chararray,TEC_002:chararray,TEC_003:chararray,TX_001_VLR:chararray,TX_001_DCM:chararray,D08_004:chararray,D11_004:chararray,RUB_016:chararray,T03_001:chararray);
-- Effectuer une jointure entre les deux tables
JOINED_TABLES = JOIN CRE_GM05 BY TEC_001, CRE_GM11 BY TEC_001;
-- Generer les colonnes
DATA_GM05 = FOREACH JOINED_TABLES GENERATE
CRE_GM05::MGM_COMPTEUR AS MGM_COMPTEUR,
CRE_GM05::CIA_CD_CRV_CIA AS CIA_CD_CRV_CIA,
CRE_GM05::CIA_DA_EM_CRV AS CIA_DA_EM_CRV,
CRE_GM05::CIA_CD_CTRL_BLCE AS CIA_CD_CTRL_BLCE,
CRE_GM05::CIA_IDC_EXTR_RDJ AS CIA_IDC_EXTR_RDJ,
CRE_GM05::CIA_VLR_IDT_CRV_LOQ AS CIA_VLR_IDT_CRV_LOQ,
CRE_GM05::CIA_VLR_REF_CRV AS CIA_VLR_REF_CRV,
CRE_GM05::CIA_VLR_LG_ZON_RTG AS CIA_VLR_LG_ZON_RTG,
CRE_GM05::CIA_HEU_CIA AS CIA_HEU_CIA,
CRE_GM05::CIA_TM_STP_CRE AS CIA_TM_STP_CRE,
CRE_GM05::CIA_VLR_1 AS CIA_VLR_1,
CRE_GM05::CIA_DA_ARR_FIC AS CIA_DA_ARR_FIC,
CRE_GM05::CIA_TY_ENR AS CIA_TY_ENR,
CRE_GM05::CIA_CD_BTE AS CIA_CD_BTE,
CRE_GM05::CIA_CD_PER AS CIA_CD_PER,
CRE_GM05::CIA_CD_EFS AS CIA_CD_EFS,
CRE_GM05::CIA_CD_ETA_VAL_CRV AS CIA_CD_ETA_VAL_CRV,
CRE_GM05::CIA_CD_EVE_CPR AS CIA_CD_EVE_CPR,
CRE_GM05::CIA_CD_APLI_TDU AS CIA_CD_APLI_TDU,
CRE_GM05::CIA_CD_STE_RTG AS CIA_CD_STE_RTG,
CRE_GM05::CIA_DA_TT_RTG AS CIA_DA_TT_RTG,
CRE_GM05::CIA_NO_ENR_RTG AS CIA_NO_ENR_RTG,
CRE_GM05::CIA_DA_VAL_EVE AS CIA_DA_VAL_EVE,
CRE_GM05::T32_001 AS T32_001,
CRE_GM05::TEC_013 AS TEC_013,
CRE_GM05::TEC_014 AS TEC_014,
CRE_GM05::DAT_001_X AS DAT_001_X,
CRE_GM05::DAT_002_X AS DAT_002_X,
CRE_GM05::TEC_001 AS TEC_001;
STORE DATA_GM05 INTO '$OUTPUT_FILE' USING PigStorage(';');
It returns data but I lost the first line of headers !
Note that my $input1 and $input2 variables are csv files
I tried using CSVLoader but it doesn't working also.
I need to get output stored with headers please

In pig final output by default there is no headers coming. Also adding header to final output will doesn't make any sense as sequence of rows is not fixed in pig output.
If you want to add header to final output, either merge all the part files data to a file in local file system where you can add header information explicitly or use hive table to store the output of this pig script. There is HCatlog store can be used for same.

Related

XPath for stackoverflow dump files

Am working with file with following format:
<badges>
<row Id="1" UserId="1" Name="Teacher" Date="2009-09-30T15:17:50.66"/>
<row Id="2" UserId="3" Name="Teacher" Date="2009-09-30T15:17:50.69"/>
</badges>
I am using pig xmlloader to fetch this xml data into hdfs.
A = LOAD '/badges' using org.apache.pig.piggybank.storage.XMLLoader('row') as (x:chararray);
B = foreach A generate xpath(x, '/row#Id').
Dump B;
Output I get () - No values.
I want the file output as text i.e 1,1,Teacher,2009-09-30T15:17:50.66. How can I do this?
I'm not familiar with pig xmlloader, but /row#Id has two problems:
It's not valid XPath
If it were, it would be an absolute path
Try:
B = foreach A generate xpath(x, 'row/#Id').
It uses valid syntax and a relative path.
Use XPathAll for extracting attributes.Xpath has an issue when it comes to attributes.
REGISTER '/path/piggybank-0.15.0.jar'; -- Use the jar name you downloaded
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
B = foreach A generate XPathAll(x, 'row/#Id', true, false).$0 as (id:chararray);

error while storing in Hbase using Pig

hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.
solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');

Parse xml data using multiple attributes in hadoop using pig or hive.

Parse below mentioned type of xml in hadoop or pig
I tried with below script in pig or hive
PowerEvent sequence="00829" elapsedrealtime="0000047391" uptime="0000047391" timestamp="2016-01-17 00:31:36.750+0100" health="Good" level="69" plugged="NotPlugged" present="Present" status="NotCharging" temperature="23.0" voltage="3731" chargercurrent="25" batterycurrent="2209" coulombcounter="4294967292" screen="Off"
ConnectivityEvent sequence="00830" elapsedrealtime="0000047471" uptime="0000047471" timestamp="2016-01-17 00:31:36.831+0100" connected ="true" available="true" activenetwork="WIFI" mobiledata="Off" cellular="Unknown" operatorid="22210" operatorname="vodafone IT"
I tried with below script
register '/home/rajpsu03/pig/piggybank.jar'
xmldata = LOAD '/user/rajpsu03/pig/test.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Events') as(doc:chararray);
data = foreach xmldata
GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'[\\s*\\S*]*<PowerEvent[\\s*\\S*]*sequence="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*elapsedrealtime="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*uptime="(.*?)"[\\s*\\S*]*/>')) AS (sequence:chararray,elapsedrealtime:chararray,uptime:chararray);
dump data;

Spark : Avro RDD to csv

I am able to read arvo file into avroRDD and am trying to convert into csvRDD which contain all the values in comma separated. With the following code I am able to read specific field into csvRDD.
val csvRDD = avroRDD .map({case (u, _) => u.datum.get("empname")})
How can I read all the values into csvRDD instead of specifying field names. My result csvRDD should contain records as follows
(100,John,25,IN)
(101,Ricky,38,AUS)
(102,Chris,68,US)
Using Spark 1.2+ with the Spark-Avro integration library by Databricks, one can convert an avro rdd to a csv rdd as follows:
val sqlContext = new SQLContext(sc)
val episodes = sqlContext.avroFile("episodes.avro")
val csv = episodes.map(_.mkString(","))
Running csv.collect().foreach(println) using this sample avro file prints
The Eleventh Hour,3 April 2010,11
The Doctor's Wife,14 May 2011,11
Horror of Fang Rock,3 September 1977,4
An Unearthly Child,23 November 1963,1
The Mysterious Planet,6 September 1986,6
Rose,26 March 2005,9
...

How to have a file as schema for pig?

So I have the following code:
My ultimate goal is to have the file produced be the schema for another input file. Is this possible? The output of the current script looks like this:
%declare INPUT '$input'
%declare SCHEMA '$schema'
%declare OUTPUT '$output'
%declare DEL '$del'
%declare COL ':'
%declare COM ','
A = LOAD '$SCHEMA' using PigStorage('$DEL') AS (field:chararray, dataType:chararray, flag:chararray, chars:chararray);
B = FOREACH A GENERATE CONCAT(field,CONCAT('$COL',CONCAT(REPLACE(REPLACE(dataType, 'decimal','double'), 'string', 'chararray'),'$COM')));
rmf $OUTPUT
STORE B INTO '$OUTPUT';
Not sure the right approach.
Here is the output:
record_id:chararray,
offer_id:double,
decision_id:double,
offer_type_cd:integer,
promo_id:double,
pymt_method_type_cd:double,
cs_result_id:double,
cs_result_usage_type_cd:double,
rate_index_type_cd:double,
sub_product_id:double,
campaign_id:double,
market_cell_id:double,
assigned_offer_id:chararray,
accepted_offer_flag:chararray,
current_offer_flag:chararray,
offer_good_until_date:chararray,
Of course, You can use the run command of pig to run the script and do the stuff as needed by you, For more explanation of how to do, refer this link
Hope it helps!

Resources