Pig - System cannot find the file specified - windows

I have error in pig latin while i run the basic command "Dump Students" for the file
>Students = LOAD 'C:\\Users\\avtar\\OneDrive\\Desktop\\student.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
image

Related

Pig Script issues

I am using pig with Hcatalog to load data from hive external table
I enter grunt using pig -useHCatalog and execute the following:
register 'datafu'
define Enumerate datafu.pig.bags.Enumerate('1');
imported_data = load 'hive external table' using org.apache.hive.hcatalog.pig.HCatLoader() ;
converted_data = foreach imported_data generate name,ip,domain,ToUnixTime(ToDate(dateandtime,'MM/dd/yyyy hh:mm:ss.SSS aa'))as unix_DateTime,date;
grouped = group converted_data by (name,ip,domain);
result = FOREACH grouped {
sorted = ORDER converted_data BY unix_DateTime;
sorted2 = Enumerate(sorted);
GENERATE FLATTEN(sorted2);
};
All commands run and provide desired result.
Problem:
I made a pig script with above commands named as pigFinal.pig and executed the following in local mode coz script in local filesystem.
pig -useHCatalog -x local '/path/to/pigFinal.pig';
Exception
Failed to generate logical plan. Nested exception:
org.apache.pig.backend.executionengine.ExecException: ERROR 1070:
Could not resolve datafu.pig.bags.Enumerate using imports: [,
java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] at
org.apache.pig.parser.LogicalPlanBuilder.buildUDF(LogicalPlanBuilder.java:1507)
at
org.apache.pig.parser.LogicalPlanGenerator.func_eval(LogicalPlanGenerator.java:9372)
at
org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:11051)
at
org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10810)
at
org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10159)
at
org.apache.pig.parser.LogicalPlanGenerator.nested_command(LogicalPlanGenerator.java:16315)
at
org.apache.pig.parser.LogicalPlanGenerator.nested_blk(LogicalPlanGenerator.java:16116)
at
org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:16024)
at
org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15849)
at
org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at
org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at
org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at
org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 17 more
Where do i need register datafu jar for pig scripts?I guess this is the issue.
Please help
You have to ensure the jar file is located in the same folder as your pigscript or ensure that the correct path is provided in the pigscript while registering the jar file. So in your case
Modify this
register 'datafu'
To
-- If,lets say datafu-1.2.0.jar is your jar file and is located in the same folder as your pigscript then in your pigscript at the top have this
REGISTER datafu-1.2.0.jar
-- Else,lets say datafu-1.2.0.jar is your jar file and is located in the folder /usr/hadoop/lib then in your pigscript at the top have this
REGISTER /usr/hadoop/lib/datafu-1.2.0.jar
pig -useHCatalog \
-x local \
-Dpig.additional.jars="/local/path/to/datafu.jar:/local/path//other.jar" \
/path/to/pigFinal.pig;
OR
in your pig script use fully qualified path
register /local/path/to/datafu.jar;

Creating schema using pig script

I need some guidance/help with a simple task to create a schema in Apache Pig for my data file. I have two files that would contribute to this task. First file is a data file which contains the data with no column header, and a second file contains the column header for the data file. So basically, the column_header file is the schema for the data file. How do i outline this in a pig script? Here's what i got so far.
column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as column_header;
schema = foreach data generate column_header;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
This is the output for
DUMP column_header
(accept_language,browser,browser_height,browser_width)
When i do,
DUMP data;
only the first line column of data is being output, which is wrong.
en-US
en-US
en-US
en-US
Instead it should be,
en-US 638 755 1600
en-US 638 655 1342
en-US 638 723 1612
en-US 638 231 1234
How can i trick Pig to use "column_header" as a string that can be use during the PigStorage AS statement on the second line of code?
Edit:
This code will work but instead of hard-coding my column_header i would like pig script to read it instead.
column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as (accept_language,browser,browser_height,browser_width);
schema = foreach data generate accept_language,browser,browser_height,browser_width;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
you can not achieve such parameterization from in the pig script directly,
you can to the same thing by
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as $column_header;
schema = foreach data generate column_header;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
and run the pig script by ,
pig -param_file (location of the file) column
The file should be of the format
column_header = complete schema
https://blogs.msdn.microsoft.com/bigdatasupport/2014/08/12/how-to-use-parameter-substitution-with-pig-latin-and-powershell/

Bulk load in hbase using pig

I have a log file in HDFS which needs to be parsed and put in a Hbase table.
I want to do this using PIG .
How can i go about it.Pig script should parse the logs and then put in Hbase?
The pig script would be (assuming tab is your data separator in log file):
A= load '/home/log.txt' using PigStorage('\t') as (one:chararray,two:chararray,three:chararray,four:chararray);
STORE A INTO 'hbase://table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('P:one,P:two,S:three,S:four');

Not able to filter data using Apache Pig

I am using Hadoop 1.0.3, Pig 0.11.0 on Ubuntu 12.04. In the part-m-00000 file in HDFS the content is as below
training#BigDataVM:~/Installations/hadoop-1.0.3$ bin/hadoop fs -cat /user/training/user/part-m-00000
1,Praveen,20,India,M
2,Prajval,5,India,M
3,Prathibha,15,India,F
I am loading it into a bag and then filtering it as below.
Users1 = load '/user/training/user/part-m-00000' as (user_id, name, age:int, country, gender);
Fltrd = filter Users1 by age <= 16;
But, when I dump the Users1 5 records are shown in the console. But, dumping Fltrd will fetch no records.
dump Fltrd;
The below warning is shown in the Pig console
2013-02-24 16:19:40,735 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 12 time(s).
Looks like I have done some simple mistake, but couldn't figure out what it is. Please help me with this.
Since you haven't defined any load function, Pig will use PigStorage in which the
default delimiter is '\t'.
If part-m-00000 is a textfile then try to set the delimiter to ',' :
Users1 = load '/user/training/user/part-m-00000' using PigStorage(',')
as (user_id, name, age:int, country, gender);
If it's a SequenceFile then have a look at Dolan's or my answer on this question.

Pig is not loading csv

I'm trying to load a pipe delimited file ('|') in pig using the following command:
A = load 'test.csv' using PigStorage('|');
But I keep getting this error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. java.net.URISyntaxException cannot be cast to java.lang.Error
I've looked all over, but I can't find any reason this would happen. The test file I have above is a simple file that just contains 1|2|3 for testing.
If you are running Pig in MAPREDUCE as the ExecType mode, then the following command should work
A = LOAD '/user/pig/input/pipetest.csv' USING PigStorage('|');
DUMP A;
Here is the output on your screen
(1,2,3)
Note that I have included the full path in HDFS for my csv file in the LOAD command

Resources