MultiStorage in pig - hadoop

I have run the below pig script in the grunt shell
Register D:\Pig\contrib\piggybank\java\piggybank.jar;
a = load '/part' using PigStorage(',') as (uuid:chararray,timestamp:chararray,Name:chararray,EmailID:chararray,CompanyName:chararray,Location:chararray);
store a into '/output/multistorage' USING MultiStorage('/output/multistorage','2', 'none', ',');
while running this it throws error as shown below
2015-11-03 05:47:36,328 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 10
70: Could not resolve MultiStorage using imports: [, java.lang., org.apache.pig.
builtin., org.apache.pig.impl.builtin.]
Can any one help me in this?

You did not import your function as the log claims. If the jar is actually accessible for you, you can try the following code (There was one missing line):
REGISTER D:\Pig\contrib\piggybank\java\piggybank.jar;
DEFINE MULTISTORAGE org.apache.pig.piggybank.storage.MultiStorage();
a = LOAD'/part' USING PigStorage(',') AS (uuid:chararray,timestamp:chararray,Name:chararray,EmailID:chararray,CompanyName:chararray,Location:chararray);
STORE a into '/output/multistorage' USING MULTISTORAGE('/output/multistorage','2', 'none', ',');
You are then partitionnig by Name.

Related

Unable to Store a Pig Relation using Parquet Storer

I am trying the below Pig statements in grunt shell.
pig version is --> Apache Pig version 0.12.1
grunt> register /home/user/surender/mapreducejars/parquet-pig-1.0.1.jar;
grunt> A = LOAD '/user/user/inputfiles/parquet.txt' USING PigStorage(',') AS (id:int,name:chararray);
grunt> STORE A into '/user/user/outputfiles/pig' USING parquet.pig.ParquetStorer;
2016-09-27 07:09:18,509 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. parquet/io/ParquetEncodingException
Details at logfile: /home/user/surender/localinputfiles/pig_1474973730264.log
I want to know what went wrong here .Can someone help me on storing the pig relation using parquetStorage
you need to add the parquet jar like parquet-pig-bundle-1.5.0.jar and register it by
REGISTER '/path_for_jar/parquet-pig-bundle-1.5.0.jar';
please check the link which explains about it.
Here's a link!

Pig Script useHCatalog flag?

I have written simple pig script to read data from hive table.
A = LOAD 'default.movie' USING org.apache.hive.hcatalog.pig.HCatLoader();
DUMP A;
It is working when i run through hue pig user interface. But it uses a flag useHCatalog.
When i run this using command line using same flag it is working
pig -useHCatalog sample.pig
But how can i run without this flag by providing required jar files and configuration in the pig script. I tried this. But doesn't work
REGISTER /usr/lib/hive/lib/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/storage-handlers/hbase/lib/*.jar
It throws an error when i run without flag
2015-12-15 05:05:55,379 [main] ERROR org.apache.pig.PigServer -
exception during parsing: Error during parsing. Table not found :
default.movie table not found Failed to parse: Can not retrieve schema
from loader org.apache.hive.hcatalog.pig.HCatLoader#25bdba7a
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1678)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1411)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:344)
at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
at org.apache.pig.PigServer.executeBatch(PigServer.java:355)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
I just want to know, what is behind useHCatalog flag. what i have to register in order to work fine?
You have to pass the hive configuration as well, namely the file hive-site.xml which will point pig to the metastore. Otherwise pig does not know where to look for the table information.
This page might be helpful: https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore

Loading Multiple Files in PIG

I have 35 Csv files I want to load the data using Pig. I have tried it with the following attempts
1) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' UsingPigStorage(',');
For this attempt I have got the error
014-10-06 00:32:07,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Can not create a Path from an empty string
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580582549.log
In the next attempt I have changed script with using SomeLoader();
2) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' using SomeLoader();
But I got the error saying this
2014-10-06 00:39:42,905 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve SomeLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580912789.log
Pig will always load all files in a directory. So you just need to specify the directory with your CSV files.
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/' using PigStorage(',');
Please also note usingPigStorage() is missing a whitespace. It should be using PigStorage().
And you have some double commas: ...HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}...
Pig supports providing file names as regular expressions. So you can provide something like:
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/HLPCA*' Using PigStorage(',');
and it will load all files with names starting from 'HLPCA' in Household directory.

Unable to resolve Over() in Apache Pig

I'm getting the following error when using Over() in Pig:
Failed to generate logical plan. Nested exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve Over using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
The error occurs upon execution of the closing brace for C:
A = load 'data/watch*.txt' as (id,ts,watch);
B= GROUP A BY id;
C= FOREACH B {
C1 = ORDER A BY ts;
GENERATE FLATTEN(Stitch(C1,Over(C1.watch,'lag',-1,0)));
}
It seems to me that Over() is not included in my Pig but I'm not sure why because I believe my versions of pig and hadoop should be sufficiently up to date.
$ pig -version
Apache Pig version 0.12.1-SNAPSHOT (rexported)
compiled Feb 19 2014, 16:31:42
$ hadoop version
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar
Any insight would be much appreciated. I'm wondering at this point if I should just use an Over() UDF from PiggyBank.
I believe there is no OVER function in the built-ins for Pig v12. You need to use the OVER function in piggybank.
REGISTER piggybank.jar
DEFINE Over org.apache.pig.piggybank.evaluation.Over();
A = load 'data/watch*.txt' as (id,ts,watch);
B= GROUP A BY id;
C= FOREACH B {
C1 = ORDER A BY ts;
GENERATE FLATTEN(Stitch(C1,Over(C1.watch,'lag',-1,0)));
}

Loading Hbase table with Pig. Float gives FIELD_DISCARDED_TYPE_CONVERSION_FAILED

I've got a HBase table that is loaded via the HBase Java api like so:
put.add(Bytes.toBytes(HBaseConnection.FAMILY_NAME), Bytes.toBytes("value"), Bytes.toBytes(value));
(Where the variable value is a normal java float.)
I proceed to load this with Pig as follows:
raw = LOAD 'hbase://tableName' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:value', '-loadKey true -limit 5') AS (id:chararray, value:float);
However when I dump this with:
dump raw;
I get:
[main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s).
for each float value. The ID's are printed fine.
Im running:
Apache Hadoop 0.20.2.05
Pig 0.9.2
Hbase 0.92.0
My question: Why cant pig handle theses float values? What am I doing wrong?
Turns out you have to add a caster. Like so:
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:value', '-loadKey true -limit 5 -caster HBaseBinaryConverter')
Please try by following way:
test = load '/user/training/user' using PigStorage(',')
as (user_id, name, age:int, country, gender);
As default delimiter for loading is tab.

Resources