Pig is not loading csv - hadoop

I'm trying to load a pipe delimited file ('|') in pig using the following command:
A = load 'test.csv' using PigStorage('|');
But I keep getting this error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. java.net.URISyntaxException cannot be cast to java.lang.Error
I've looked all over, but I can't find any reason this would happen. The test file I have above is a simple file that just contains 1|2|3 for testing.

If you are running Pig in MAPREDUCE as the ExecType mode, then the following command should work
A = LOAD '/user/pig/input/pipetest.csv' USING PigStorage('|');
DUMP A;
Here is the output on your screen
(1,2,3)
Note that I have included the full path in HDFS for my csv file in the LOAD command

Related

Pig Script issues

I am using pig with Hcatalog to load data from hive external table
I enter grunt using pig -useHCatalog and execute the following:
register 'datafu'
define Enumerate datafu.pig.bags.Enumerate('1');
imported_data = load 'hive external table' using org.apache.hive.hcatalog.pig.HCatLoader() ;
converted_data = foreach imported_data generate name,ip,domain,ToUnixTime(ToDate(dateandtime,'MM/dd/yyyy hh:mm:ss.SSS aa'))as unix_DateTime,date;
grouped = group converted_data by (name,ip,domain);
result = FOREACH grouped {
sorted = ORDER converted_data BY unix_DateTime;
sorted2 = Enumerate(sorted);
GENERATE FLATTEN(sorted2);
};
All commands run and provide desired result.
Problem:
I made a pig script with above commands named as pigFinal.pig and executed the following in local mode coz script in local filesystem.
pig -useHCatalog -x local '/path/to/pigFinal.pig';
Exception
Failed to generate logical plan. Nested exception:
org.apache.pig.backend.executionengine.ExecException: ERROR 1070:
Could not resolve datafu.pig.bags.Enumerate using imports: [,
java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] at
org.apache.pig.parser.LogicalPlanBuilder.buildUDF(LogicalPlanBuilder.java:1507)
at
org.apache.pig.parser.LogicalPlanGenerator.func_eval(LogicalPlanGenerator.java:9372)
at
org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:11051)
at
org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10810)
at
org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10159)
at
org.apache.pig.parser.LogicalPlanGenerator.nested_command(LogicalPlanGenerator.java:16315)
at
org.apache.pig.parser.LogicalPlanGenerator.nested_blk(LogicalPlanGenerator.java:16116)
at
org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:16024)
at
org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15849)
at
org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at
org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at
org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at
org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 17 more
Where do i need register datafu jar for pig scripts?I guess this is the issue.
Please help
You have to ensure the jar file is located in the same folder as your pigscript or ensure that the correct path is provided in the pigscript while registering the jar file. So in your case
Modify this
register 'datafu'
To
-- If,lets say datafu-1.2.0.jar is your jar file and is located in the same folder as your pigscript then in your pigscript at the top have this
REGISTER datafu-1.2.0.jar
-- Else,lets say datafu-1.2.0.jar is your jar file and is located in the folder /usr/hadoop/lib then in your pigscript at the top have this
REGISTER /usr/hadoop/lib/datafu-1.2.0.jar
pig -useHCatalog \
-x local \
-Dpig.additional.jars="/local/path/to/datafu.jar:/local/path//other.jar" \
/path/to/pigFinal.pig;
OR
in your pig script use fully qualified path
register /local/path/to/datafu.jar;

Apache Pig: How to load a sequence file which is stored in hdfs?

My sequence files are stored directly in hdfs e.g.:
grunt> ls
grunt> ls /blabla
hdfs://namenode1:54310/blabla/0411f03a-db7f-48d0-9542-5203304e3e81.seq<r 3> 185284523
hdfs://namenode1:54310/blabla/05be8fc0-e967-42e1-b76a-0d7108a69d17.seq<r 3> 201489688
hdfs://namenode1:54310/blabla/06222427-519c-49c0-bbbf-49a9f43bbd13.seq<r 3> 196858576
hdfs://namenode1:54310/blabla/066da26a-48da-45b1-83f5-60d16475e40d.seq<r 3> 194832641
hdfs://namenode1:54310/blabla/07cbfc83-42a2-47bf-b364-d39da3a2d071.seq<r 3> 194806047
hdfs://namenode1:54310/blabla/10dea7b8-9ed3-4e66-b4bd-a3c07d8bf39e.seq<r 3> 166224702
How can I create a Pig script which is reading every file from the directory "blabla" and performing an action?
I've tried multiple ways for loading the input but none of those worked:
%default INPUT '/blabla/f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
%default INPUT 'hdfs://namenode1:54310/blabla/f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
%default INPUT 'f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
I always get the error:
Input(s):
Failed to read data from "hdfs://namenode1:54310/........."
You can try reading the Sequence Files in these ways :
Pig SequenceFileLoader :
A = LOAD 'hdfs://namenode1:54310/blabla/*' using org.apache.pig.piggybank.storage.SequenceFileLoader();
(Or) Using Elephant Bird :
REGISTER 'elephant-bird-pig-3.0.5.jar';
REGISTER 'elephant-bird-core-4.1.jar';
REGISTER 'elephant-bird-hadoop-compat-4.1.jar';
A = LOAD 'hdfs://namenode1:54310/blabla/*' using com.twitter.elephantbird.pig.load.SequenceFileLoader();
Did you try this way :
%default INPUT 'hdfs://namenode1:54310/blabla/*'
?
It should work if your .seq files are readables. It looks like they are not, because your attempt to do it should have load one file. Could-you give the complete log line?
Maybe you would have to use pig SequenceFileLoader.

How to do a bulkload to Hbase from CSV from command line

I am trying to do a bulkload which is a csv file using command line.
This is what I am trying
bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://localhost:9000/transactionsFile.csv bulkLoadtable
The error I am getting is below:
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:9000/transactionsFile.csv
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory hdfs://localhost:9000/transactionsFile.csv. Does it contain files in subdirectories that correspond to column family names?
Is it possible to do bulkload from command line without using java mapreduce.
You are almost correct, only thing missed is that the input to the bulkLoadtable must be directory. I suggest to keep the csv file under a directory and pass the path upto directory name as an argument to the command. Please refer the below link.
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.html#doBulkLoad(org.apache.hadoop.fs.Path,%20org.apache.hadoop.hbase.client.Admin,%20org.apache.hadoop.hbase.client.Table,%20org.apache.hadoop.hbase.client.RegionLocator)
Hope this helps.
You can do bulk load from command line,
There are multiple ways to do this,
a. Prepare your data by creating data files (StoreFiles) from a MapReduce job using HFileOutputFormat.
b. Import the prepared data using the completebulkload tool
eg: hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
more details,
hbase bulk load
2.
Using importtsv
eg:
hbase> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs://sandbox.hortonworks.com:/tmp/hbase.csv
more details

"Doesn't exist in RM" backend error in Pig

I'm getting an error in the Cloudera QuickStart VM I downloaded from http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html.
I am trying a toy example from Tom White's Hadoop: The Definitive Guide book called map_temp.pig, which "finds the maximum temperature by year".
I created a file called temps.txt that contains (year, temperature, quality) entries on each line:
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
Using the example code in the book, I typed the following Pig code into the Grunt terminal:
records = LOAD '/home/cloudera/Desktop/temps.txt'
AS (year:chararray, temperature:int, quality:int);
DUMP records;
After I typed DUMP records;, I got the error:
2014-05-22 11:33:34,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias records. Backend error : org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1400775973236_0006' doesn't exist in RM.
…
Details at logfile: /home/cloudera/Desktop/pig_1400782722689.log
I attempted to find out what was causing the error through a Google search: https://www.google.com/search?q=%22application+with+id%22+%22doesn%27t+exist+in+RM%22.
The results there weren't helpful. For example, http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-troubleshoot-error-vpc.html mentioned this bug and said "To solve this problem, you must configure a VPC that includes a DHCP Options Set whose parameters are set to the following values..."
Amazon's suggested fix doesn't seem to be the problem because I'm not using using AWS.
EDIT:
I think the HDFS file path is correct.
[cloudera#localhost Desktop]$ ls
Eclipse.desktop gnome-terminal.desktop max_temp.pig temps.txt
[cloudera#localhost Desktop]$ pwd
/home/cloudera/Desktop
there's another exception before your error :
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://localhost.localdomain:8020/home/cloudera/Desktop/temps.txt
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
Is your file in HDFS? Have you checked the file path?
I was able to solve this problem by doing pig -x local to start the Grunt interpreter instead of just pig.
I should have used local mode because I did not have access to a Hadoop cluster.
This gave me the errors:
2014-05-22 11:33:34,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias records. Backend error : org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1400775973236_0006' doesn't exist in RM.
2014-05-22 11:33:28,799 [JobControl] WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://localhost.localdomain:8020/home/cloudera/Desktop/temps.txt
From http://pig.apache.org/docs/r0.9.1/start.html:
Pig has two execution modes or exectypes:
Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).
You can run Pig in either mode using the "pig" command (the bin/pig Perl script) or the "java" command (java -cp pig.jar ...).
Running the toy example from Tom White's Hadoop: The Definitive Guide book:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'temps.txt' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
against the following data set in temps.txt (remember that Pig's default input is tab-delimited files):
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
gives this:
grunt> [cloudera#localhost Desktop]$ pig -x local -f max_temp.pig 2>log
(1949,111)
(1950,22)

Not able to filter data using Apache Pig

I am using Hadoop 1.0.3, Pig 0.11.0 on Ubuntu 12.04. In the part-m-00000 file in HDFS the content is as below
training#BigDataVM:~/Installations/hadoop-1.0.3$ bin/hadoop fs -cat /user/training/user/part-m-00000
1,Praveen,20,India,M
2,Prajval,5,India,M
3,Prathibha,15,India,F
I am loading it into a bag and then filtering it as below.
Users1 = load '/user/training/user/part-m-00000' as (user_id, name, age:int, country, gender);
Fltrd = filter Users1 by age <= 16;
But, when I dump the Users1 5 records are shown in the console. But, dumping Fltrd will fetch no records.
dump Fltrd;
The below warning is shown in the Pig console
2013-02-24 16:19:40,735 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 12 time(s).
Looks like I have done some simple mistake, but couldn't figure out what it is. Please help me with this.
Since you haven't defined any load function, Pig will use PigStorage in which the
default delimiter is '\t'.
If part-m-00000 is a textfile then try to set the delimiter to ',' :
Users1 = load '/user/training/user/part-m-00000' using PigStorage(',')
as (user_id, name, age:int, country, gender);
If it's a SequenceFile then have a look at Dolan's or my answer on this question.

Resources