I am trying the below Pig statements in grunt shell.
pig version is --> Apache Pig version 0.12.1
grunt> register /home/user/surender/mapreducejars/parquet-pig-1.0.1.jar;
grunt> A = LOAD '/user/user/inputfiles/parquet.txt' USING PigStorage(',') AS (id:int,name:chararray);
grunt> STORE A into '/user/user/outputfiles/pig' USING parquet.pig.ParquetStorer;
2016-09-27 07:09:18,509 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. parquet/io/ParquetEncodingException
Details at logfile: /home/user/surender/localinputfiles/pig_1474973730264.log
I want to know what went wrong here .Can someone help me on storing the pig relation using parquetStorage
you need to add the parquet jar like parquet-pig-bundle-1.5.0.jar and register it by
REGISTER '/path_for_jar/parquet-pig-bundle-1.5.0.jar';
please check the link which explains about it.
Here's a link!
I have written simple pig script to read data from hive table.
A = LOAD 'default.movie' USING org.apache.hive.hcatalog.pig.HCatLoader();
DUMP A;
It is working when i run through hue pig user interface. But it uses a flag useHCatalog.
When i run this using command line using same flag it is working
pig -useHCatalog sample.pig
But how can i run without this flag by providing required jar files and configuration in the pig script. I tried this. But doesn't work
REGISTER /usr/lib/hive/lib/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/storage-handlers/hbase/lib/*.jar
It throws an error when i run without flag
2015-12-15 05:05:55,379 [main] ERROR org.apache.pig.PigServer -
exception during parsing: Error during parsing. Table not found :
default.movie table not found Failed to parse: Can not retrieve schema
from loader org.apache.hive.hcatalog.pig.HCatLoader#25bdba7a
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1678)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1411)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:344)
at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
at org.apache.pig.PigServer.executeBatch(PigServer.java:355)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
I just want to know, what is behind useHCatalog flag. what i have to register in order to work fine?
You have to pass the hive configuration as well, namely the file hive-site.xml which will point pig to the metastore. Otherwise pig does not know where to look for the table information.
This page might be helpful: https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore
I have 35 Csv files I want to load the data using Pig. I have tried it with the following attempts
1) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' UsingPigStorage(',');
For this attempt I have got the error
014-10-06 00:32:07,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Can not create a Path from an empty string
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580582549.log
In the next attempt I have changed script with using SomeLoader();
2) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' using SomeLoader();
But I got the error saying this
2014-10-06 00:39:42,905 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve SomeLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580912789.log
Pig will always load all files in a directory. So you just need to specify the directory with your CSV files.
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/' using PigStorage(',');
Please also note usingPigStorage() is missing a whitespace. It should be using PigStorage().
And you have some double commas: ...HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}...
Pig supports providing file names as regular expressions. So you can provide something like:
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/HLPCA*' Using PigStorage(',');
and it will load all files with names starting from 'HLPCA' in Household directory.
I'm getting the following error when using Over() in Pig:
Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve Over using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
The error occurs upon execution of the closing brace for C:
A = load 'data/watch*.txt' as (id,ts,watch);
B= GROUP A BY id;
C= FOREACH B {
C1 = ORDER A BY ts;
GENERATE FLATTEN(Stitch(C1,Over(C1.watch,'lag',-1,0)));
}
It seems to me that Over() is not included in my Pig but I'm not sure why because I believe my versions of pig and hadoop should be sufficiently up to date.
$ pig -version
Apache Pig version 0.12.1-SNAPSHOT (rexported)
compiled Feb 19 2014, 16:31:42
$ hadoop version
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar
Any insight would be much appreciated. I'm wondering at this point if I should just use an Over() UDF from PiggyBank.
I believe there is no OVER function in the built-ins for Pig v12. You need to use the OVER function in piggybank.
REGISTER piggybank.jar
DEFINE Over org.apache.pig.piggybank.evaluation.Over();
A = load 'data/watch*.txt' as (id,ts,watch);
B= GROUP A BY id;
C= FOREACH B {
C1 = ORDER A BY ts;
GENERATE FLATTEN(Stitch(C1,Over(C1.watch,'lag',-1,0)));
}
I've got a HBase table that is loaded via the HBase Java api like so:
put.add(Bytes.toBytes(HBaseConnection.FAMILY_NAME), Bytes.toBytes("value"), Bytes.toBytes(value));
(Where the variable value is a normal java float.)
I proceed to load this with Pig as follows:
raw = LOAD 'hbase://tableName' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:value', '-loadKey true -limit 5') AS (id:chararray, value:float);
However when I dump this with:
dump raw;
I get:
[main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s).
for each float value. The ID's are printed fine.
Im running:
Apache Hadoop 0.20.2.05
Pig 0.9.2
Hbase 0.92.0
My question: Why cant pig handle theses float values? What am I doing wrong?
Turns out you have to add a caster. Like so:
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:value', '-loadKey true -limit 5 -caster HBaseBinaryConverter')
Please try by following way:
test = load '/user/training/user' using PigStorage(',')
as (user_id, name, age:int, country, gender);
As default delimiter for loading is tab.