Unable to Store a Pig Relation using Parquet Storer - hadoop

I am trying the below Pig statements in grunt shell.
pig version is --> Apache Pig version 0.12.1
grunt> register /home/user/surender/mapreducejars/parquet-pig-1.0.1.jar;
grunt> A = LOAD '/user/user/inputfiles/parquet.txt' USING PigStorage(',') AS (id:int,name:chararray);
grunt> STORE A into '/user/user/outputfiles/pig' USING parquet.pig.ParquetStorer;
2016-09-27 07:09:18,509 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. parquet/io/ParquetEncodingException
Details at logfile: /home/user/surender/localinputfiles/pig_1474973730264.log
I want to know what went wrong here .Can someone help me on storing the pig relation using parquetStorage

you need to add the parquet jar like parquet-pig-bundle-1.5.0.jar and register it by
REGISTER '/path_for_jar/parquet-pig-bundle-1.5.0.jar';
please check the link which explains about it.
Here's a link!

Related

How to supress info message "io.bytes.per.checksum is deprecated" in grunt shell

When analyzing a Big Data I'm running Apache Pig version 0.17.0 on top of Hadoop-2.7.2. Every time i run a load command in local mode of grunt> shell i get the following message:
grunt> A = load '/usr/lib/pig/data.txt' using TextLoader as (date:chararray);
[main] INFO org.apache.hadoop.conf.Configuration.deprecation-io.bytes.per.
checksum is deprecated. Instead, use dfs.bytes-per-checksum
Is there away to switch off this message as it becomes very annoying with frequent usage of grunt> shell?
Check if below solution works for you,
Create a file named nolog.conf, with the following content
log4j.rootLogger=fatal
and then run pig as follows
pig -x local -4 nolog.conf

ERROR 1066: Unable to open iterator for alias in Pig 0.14

I have Hadoop version 2.7.1 and Pig 0.14.0 installed on CentOS PC. I tried running Pig dump command on Grunt shell, but it failed with below error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias
Below are the commands I am running:
grunt> youtube_dump = LOAD '/youtubedata.txt' as (video_id:chararray , uploader:chararray, upload_interval : int, category:chararray, video_length:int, views:int, rating:float, no_rating:int, no_comments:int, related_video_ids:chararray);
grunt> grp_by_category = GROUP youtube_dump by category;
grunt> dump grp_by_category;
If I try Pig in local mode with file youtubedata.txt copied to local disk, I don't get any error.
I was finally able to resolve the issue. Here's what I found after doing some research. Apparently, my yarn-site.xml was in-correctly configured.There was a typo in the property name section. Instead of "yarn.nodemanager.aux-services" it was "yarn.nodemanager.aux-service" (missing letter "s" in services).
After updating conf file yarn-site.xml correctly, I was able to run pig commands without any error.

Pig Script useHCatalog flag?

I have written simple pig script to read data from hive table.
A = LOAD 'default.movie' USING org.apache.hive.hcatalog.pig.HCatLoader();
DUMP A;
It is working when i run through hue pig user interface. But it uses a flag useHCatalog.
When i run this using command line using same flag it is working
pig -useHCatalog sample.pig
But how can i run without this flag by providing required jar files and configuration in the pig script. I tried this. But doesn't work
REGISTER /usr/lib/hive/lib/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/storage-handlers/hbase/lib/*.jar
It throws an error when i run without flag
2015-12-15 05:05:55,379 [main] ERROR org.apache.pig.PigServer -
exception during parsing: Error during parsing. Table not found :
default.movie table not found Failed to parse: Can not retrieve schema
from loader org.apache.hive.hcatalog.pig.HCatLoader#25bdba7a
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1678)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1411)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:344)
at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
at org.apache.pig.PigServer.executeBatch(PigServer.java:355)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
I just want to know, what is behind useHCatalog flag. what i have to register in order to work fine?
You have to pass the hive configuration as well, namely the file hive-site.xml which will point pig to the metastore. Otherwise pig does not know where to look for the table information.
This page might be helpful: https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore

Loading Multiple Files in PIG

I have 35 Csv files I want to load the data using Pig. I have tried it with the following attempts
1) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' UsingPigStorage(',');
For this attempt I have got the error
014-10-06 00:32:07,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Can not create a Path from an empty string
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580582549.log
In the next attempt I have changed script with using SomeLoader();
2) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' using SomeLoader();
But I got the error saying this
2014-10-06 00:39:42,905 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve SomeLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580912789.log
Pig will always load all files in a directory. So you just need to specify the directory with your CSV files.
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/' using PigStorage(',');
Please also note usingPigStorage() is missing a whitespace. It should be using PigStorage().
And you have some double commas: ...HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}...
Pig supports providing file names as regular expressions. So you can provide something like:
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/HLPCA*' Using PigStorage(',');
and it will load all files with names starting from 'HLPCA' in Household directory.

Loading Hbase table with Pig. Float gives FIELD_DISCARDED_TYPE_CONVERSION_FAILED

I've got a HBase table that is loaded via the HBase Java api like so:
put.add(Bytes.toBytes(HBaseConnection.FAMILY_NAME), Bytes.toBytes("value"), Bytes.toBytes(value));
(Where the variable value is a normal java float.)
I proceed to load this with Pig as follows:
raw = LOAD 'hbase://tableName' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:value', '-loadKey true -limit 5') AS (id:chararray, value:float);
However when I dump this with:
dump raw;
I get:
[main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 time(s).
for each float value. The ID's are printed fine.
Im running:
Apache Hadoop 0.20.2.05
Pig 0.9.2
Hbase 0.92.0
My question: Why cant pig handle theses float values? What am I doing wrong?
Turns out you have to add a caster. Like so:
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('family:value', '-loadKey true -limit 5 -caster HBaseBinaryConverter')
Please try by following way:
test = load '/user/training/user' using PigStorage(',')
as (user_id, name, age:int, country, gender);
As default delimiter for loading is tab.

Resources