How to have a file as schema for pig? - hadoop

So I have the following code:
My ultimate goal is to have the file produced be the schema for another input file. Is this possible? The output of the current script looks like this:
%declare INPUT '$input'
%declare SCHEMA '$schema'
%declare OUTPUT '$output'
%declare DEL '$del'
%declare COL ':'
%declare COM ','
A = LOAD '$SCHEMA' using PigStorage('$DEL') AS (field:chararray, dataType:chararray, flag:chararray, chars:chararray);
B = FOREACH A GENERATE CONCAT(field,CONCAT('$COL',CONCAT(REPLACE(REPLACE(dataType, 'decimal','double'), 'string', 'chararray'),'$COM')));
rmf $OUTPUT
STORE B INTO '$OUTPUT';
Not sure the right approach.
Here is the output:
record_id:chararray,
offer_id:double,
decision_id:double,
offer_type_cd:integer,
promo_id:double,
pymt_method_type_cd:double,
cs_result_id:double,
cs_result_usage_type_cd:double,
rate_index_type_cd:double,
sub_product_id:double,
campaign_id:double,
market_cell_id:double,
assigned_offer_id:chararray,
accepted_offer_flag:chararray,
current_offer_flag:chararray,
offer_good_until_date:chararray,

Of course, You can use the run command of pig to run the script and do the stuff as needed by you, For more explanation of how to do, refer this link
Hope it helps!

Related

Why header is automatically skipped in output file

I want to storage my data without skipping data header
This is my pig script :
CRE_GM05 = LOAD '$input1' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,T32_001:chararray,TEC_013:chararray,TEC_014:chararray,DAT_001_X:chararray,DAT_002_X:chararray,TEC_001:chararray);
CRE_GM11 = LOAD '$input2' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,DAT_001_X:chararray,DAT_002_X:chararray,D08_001:chararray,PSE_001:chararray,PSE_002:chararray,PSE_003:chararray,RUB_001:chararray,RUB_002:chararray,RUB_003:chararray,RUB_004:chararray,RUB_005:chararray,RUB_006:chararray,RUB_007:chararray,RUB_008:chararray,RUB_009:chararray,RUB_010:chararray,TEC_001:chararray,TEC_002:chararray,TEC_003:chararray,TX_001_VLR:chararray,TX_001_DCM:chararray,D08_004:chararray,D11_004:chararray,RUB_016:chararray,T03_001:chararray);
-- Effectuer une jointure entre les deux tables
JOINED_TABLES = JOIN CRE_GM05 BY TEC_001, CRE_GM11 BY TEC_001;
-- Generer les colonnes
DATA_GM05 = FOREACH JOINED_TABLES GENERATE
CRE_GM05::MGM_COMPTEUR AS MGM_COMPTEUR,
CRE_GM05::CIA_CD_CRV_CIA AS CIA_CD_CRV_CIA,
CRE_GM05::CIA_DA_EM_CRV AS CIA_DA_EM_CRV,
CRE_GM05::CIA_CD_CTRL_BLCE AS CIA_CD_CTRL_BLCE,
CRE_GM05::CIA_IDC_EXTR_RDJ AS CIA_IDC_EXTR_RDJ,
CRE_GM05::CIA_VLR_IDT_CRV_LOQ AS CIA_VLR_IDT_CRV_LOQ,
CRE_GM05::CIA_VLR_REF_CRV AS CIA_VLR_REF_CRV,
CRE_GM05::CIA_VLR_LG_ZON_RTG AS CIA_VLR_LG_ZON_RTG,
CRE_GM05::CIA_HEU_CIA AS CIA_HEU_CIA,
CRE_GM05::CIA_TM_STP_CRE AS CIA_TM_STP_CRE,
CRE_GM05::CIA_VLR_1 AS CIA_VLR_1,
CRE_GM05::CIA_DA_ARR_FIC AS CIA_DA_ARR_FIC,
CRE_GM05::CIA_TY_ENR AS CIA_TY_ENR,
CRE_GM05::CIA_CD_BTE AS CIA_CD_BTE,
CRE_GM05::CIA_CD_PER AS CIA_CD_PER,
CRE_GM05::CIA_CD_EFS AS CIA_CD_EFS,
CRE_GM05::CIA_CD_ETA_VAL_CRV AS CIA_CD_ETA_VAL_CRV,
CRE_GM05::CIA_CD_EVE_CPR AS CIA_CD_EVE_CPR,
CRE_GM05::CIA_CD_APLI_TDU AS CIA_CD_APLI_TDU,
CRE_GM05::CIA_CD_STE_RTG AS CIA_CD_STE_RTG,
CRE_GM05::CIA_DA_TT_RTG AS CIA_DA_TT_RTG,
CRE_GM05::CIA_NO_ENR_RTG AS CIA_NO_ENR_RTG,
CRE_GM05::CIA_DA_VAL_EVE AS CIA_DA_VAL_EVE,
CRE_GM05::T32_001 AS T32_001,
CRE_GM05::TEC_013 AS TEC_013,
CRE_GM05::TEC_014 AS TEC_014,
CRE_GM05::DAT_001_X AS DAT_001_X,
CRE_GM05::DAT_002_X AS DAT_002_X,
CRE_GM05::TEC_001 AS TEC_001;
STORE DATA_GM05 INTO '$OUTPUT_FILE' USING PigStorage(';');
It returns data but I lost the first line of headers !
Note that my $input1 and $input2 variables are csv files
I tried using CSVLoader but it doesn't working also.
I need to get output stored with headers please
In pig final output by default there is no headers coming. Also adding header to final output will doesn't make any sense as sequence of rows is not fixed in pig output.
If you want to add header to final output, either merge all the part files data to a file in local file system where you can add header information explicitly or use hive table to store the output of this pig script. There is HCatlog store can be used for same.

XPath for stackoverflow dump files

Am working with file with following format:
<badges>
<row Id="1" UserId="1" Name="Teacher" Date="2009-09-30T15:17:50.66"/>
<row Id="2" UserId="3" Name="Teacher" Date="2009-09-30T15:17:50.69"/>
</badges>
I am using pig xmlloader to fetch this xml data into hdfs.
A = LOAD '/badges' using org.apache.pig.piggybank.storage.XMLLoader('row') as (x:chararray);
B = foreach A generate xpath(x, '/row#Id').
Dump B;
Output I get () - No values.
I want the file output as text i.e 1,1,Teacher,2009-09-30T15:17:50.66. How can I do this?
I'm not familiar with pig xmlloader, but /row#Id has two problems:
It's not valid XPath
If it were, it would be an absolute path
Try:
B = foreach A generate xpath(x, 'row/#Id').
It uses valid syntax and a relative path.
Use XPathAll for extracting attributes.Xpath has an issue when it comes to attributes.
REGISTER '/path/piggybank-0.15.0.jar'; -- Use the jar name you downloaded
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
B = foreach A generate XPathAll(x, 'row/#Id', true, false).$0 as (id:chararray);

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

Ruby - CSV works while SmarteCSV doesn't

I want to open a csv file using SmarterCSV.process
market_csv = SmarterCSV.process(market)
p "just read #{market_csv}"
The problem is that the data is not read and this prints:
[]
However, if I attempt the same thing with the default CSV library implementation the content of the file is read(the following print statement prints the file).
CSV.foreach(market) do |row|
p row
end
The content of the file I was reading is of the form:
Date,Close
03/06/15,0.1634
02/06/15,0.1637
01/06/15,0.1638
31/05/15,0.1638
The problem could come from the line separator, the file is not exactly the same if you're using windows or unix system ("\r\n" or "\r"). Try to identify and specify the character in the SmarterCSV.process like this:
market_csv = SmarterCSV.process(market, row_sep: "\r")
p "just read #{market_csv}"
or like this:
market_csv = SmarterCSV.process(market, row_sep: :auto)
p "just read #{market_csv}"

Encoding german characters

I need to import with load data some perl - generated files to oracle database.
Perl-script get a webpage and write csv file.
Here a simplified script:
use File::Slurp;
my $c= ( $user && $passwd )
? get("$protocol://$user:$passwd\#$url")
: get("$protocol://$url");
write_file("$VZ_GET/$FileTS.$typ.csv",$c);
Here a sample line from the webpage:
5052;97;Jan;Ihrfelt 5053;97;Jari;Honko 5121;97;Katja;Keitaanniemi 5302;97;Ola;Södermark 5421;97;Sven;Sköld 5609;97;Peter;Näslund
Content of the webpage is saved in var $c.
Here a sample line of csv file:
5053;97;Jari;Honko
Here a load command:
LOAD DATA
INTO TABLE LIQA
TRUNCATE
FIELDS TERMINATED BY ";"
(
LIQA_ANALYST_ID,
LIQA_FIRM_ID,
LIQA_ANALYST_FIRST_NAME,
LIQA_ANALYST_LAST_NAME,
LIQA_TS_INSERT DATE 'YYYYMMDDHH24MISS'
)
Command SELECT * FROM NLS_DATABASE_PARAMETERS WHERE PARAMETER = 'NLS_CHARACTERSET'; returns AL32UTF8.
The generated csv file is recognized as UTF-8 Unicode text.
Anyhow I cant import german characters. In the csv file they are still correct. But it is not the case in the database.
I have also tried to convert $c like this:
$c = encode("iso-8859-1", $c);
The generated csv file is stll recognized as UTF-8 Unicode text.
I have no clue how can I fix it.
I have solved it:
$c = decode( 'utf-8', $c );
$c = encode( 'iso-8859-1' , $c );

Resources