Loading Multiple Files in PIG - hadoop

I have 35 Csv files I want to load the data using Pig. I have tried it with the following attempts
1) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' UsingPigStorage(',');
For this attempt I have got the error
014-10-06 00:32:07,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Can not create a Path from an empty string
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580582549.log
In the next attempt I have changed script with using SomeLoader();
2) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' using SomeLoader();
But I got the error saying this
2014-10-06 00:39:42,905 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve SomeLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580912789.log

Pig will always load all files in a directory. So you just need to specify the directory with your CSV files.
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/' using PigStorage(',');
Please also note usingPigStorage() is missing a whitespace. It should be using PigStorage().
And you have some double commas: ...HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}...

Pig supports providing file names as regular expressions. So you can provide something like:
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/HLPCA*' Using PigStorage(',');
and it will load all files with names starting from 'HLPCA' in Household directory.

Related

Unable to Store a Pig Relation using Parquet Storer

I am trying the below Pig statements in grunt shell.
pig version is --> Apache Pig version 0.12.1
grunt> register /home/user/surender/mapreducejars/parquet-pig-1.0.1.jar;
grunt> A = LOAD '/user/user/inputfiles/parquet.txt' USING PigStorage(',') AS (id:int,name:chararray);
grunt> STORE A into '/user/user/outputfiles/pig' USING parquet.pig.ParquetStorer;
2016-09-27 07:09:18,509 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. parquet/io/ParquetEncodingException
Details at logfile: /home/user/surender/localinputfiles/pig_1474973730264.log
I want to know what went wrong here .Can someone help me on storing the pig relation using parquetStorage
you need to add the parquet jar like parquet-pig-bundle-1.5.0.jar and register it by
REGISTER '/path_for_jar/parquet-pig-bundle-1.5.0.jar';
please check the link which explains about it.
Here's a link!

Pig Script useHCatalog flag?

I have written simple pig script to read data from hive table.
A = LOAD 'default.movie' USING org.apache.hive.hcatalog.pig.HCatLoader();
DUMP A;
It is working when i run through hue pig user interface. But it uses a flag useHCatalog.
When i run this using command line using same flag it is working
pig -useHCatalog sample.pig
But how can i run without this flag by providing required jar files and configuration in the pig script. I tried this. But doesn't work
REGISTER /usr/lib/hive/lib/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/storage-handlers/hbase/lib/*.jar
It throws an error when i run without flag
2015-12-15 05:05:55,379 [main] ERROR org.apache.pig.PigServer -
exception during parsing: Error during parsing. Table not found :
default.movie table not found Failed to parse: Can not retrieve schema
from loader org.apache.hive.hcatalog.pig.HCatLoader#25bdba7a
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1678)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1411)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:344)
at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
at org.apache.pig.PigServer.executeBatch(PigServer.java:355)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
I just want to know, what is behind useHCatalog flag. what i have to register in order to work fine?
You have to pass the hive configuration as well, namely the file hive-site.xml which will point pig to the metastore. Otherwise pig does not know where to look for the table information.
This page might be helpful: https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore

MultiStorage in pig

I have run the below pig script in the grunt shell
Register D:\Pig\contrib\piggybank\java\piggybank.jar;
a = load '/part' using PigStorage(',') as (uuid:chararray,timestamp:chararray,Name:chararray,EmailID:chararray,CompanyName:chararray,Location:chararray);
store a into '/output/multistorage' USING MultiStorage('/output/multistorage','2', 'none', ',');
while running this it throws error as shown below
2015-11-03 05:47:36,328 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 10
70: Could not resolve MultiStorage using imports: [, java.lang., org.apache.pig.
builtin., org.apache.pig.impl.builtin.]
Can any one help me in this?
You did not import your function as the log claims. If the jar is actually accessible for you, you can try the following code (There was one missing line):
REGISTER D:\Pig\contrib\piggybank\java\piggybank.jar;
DEFINE MULTISTORAGE org.apache.pig.piggybank.storage.MultiStorage();
a = LOAD'/part' USING PigStorage(',') AS (uuid:chararray,timestamp:chararray,Name:chararray,EmailID:chararray,CompanyName:chararray,Location:chararray);
STORE a into '/output/multistorage' USING MULTISTORAGE('/output/multistorage','2', 'none', ',');
You are then partitionnig by Name.

Cannot get schema from loadFunc org.apache.pig.builtin.AvroStorage

I am getting following error while running following pig script
REGISTER /opt/cloudera/parcels/CDH/lib/pig/lib/avro.jar
REGISTER /opt/cloudera/parcels/CDH/lib/pig/lib/json-simple-1.1.jar
REGISTER /opt/cloudera/parcels/CDH/lib/pig/lib/jackson-core-asl-1.8.8.jar
REGISTER /opt/cloudera/parcels/CDH/lib/pig/lib/jackson-mapper-asl-1.8.8.jar
REGISTER /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar
list_cookies = LOAD '/user/xyz/testbed/llama-2014-Oct-12d/abc'
USING org.apache.pig.piggybank.storage.avro.AvroStorage();
got following error
2014-10-22 11:51:14,705 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.pig.builtin.AvroStorage
Details at logfile: /home/xyz/pig_1413991623605.log
In my case, it was simply the fact that the input folder did not exist. Pig error messages are off the mark and not at all helpful. After changing the input folder to one that existed, this error went away. So, be sure to check that before spending a lot of time more difficult debugging!

hadoop pig mapreduce distributed cach files

I am using hadoop1.2.1 and pig0.10. I have some jar files used in mapreduce job. So I copied jar files to hdfs in /tmp/lib path .Then in pig script i tried to add statement like SET mapred.cache.files /tmp/lib/file.jar; SET mapred.create.symlink yes;. But I got error as
Exception in thread "main" org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. <line 1, column 0> Syntax error, unexpected symbol at or near 'SET'
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1597)
Try this:
SET mapred.cache.files '/tmp/lib/file.jar'

Resources