Pig Script issues - hadoop

I am using pig with Hcatalog to load data from hive external table
I enter grunt using pig -useHCatalog and execute the following:
register 'datafu'
define Enumerate datafu.pig.bags.Enumerate('1');
imported_data = load 'hive external table' using org.apache.hive.hcatalog.pig.HCatLoader() ;
converted_data = foreach imported_data generate name,ip,domain,ToUnixTime(ToDate(dateandtime,'MM/dd/yyyy hh:mm:ss.SSS aa'))as unix_DateTime,date;
grouped = group converted_data by (name,ip,domain);
result = FOREACH grouped {
sorted = ORDER converted_data BY unix_DateTime;
sorted2 = Enumerate(sorted);
GENERATE FLATTEN(sorted2);
};
All commands run and provide desired result.
Problem:
I made a pig script with above commands named as pigFinal.pig and executed the following in local mode coz script in local filesystem.
pig -useHCatalog -x local '/path/to/pigFinal.pig';
Exception
Failed to generate logical plan. Nested exception:
org.apache.pig.backend.executionengine.ExecException: ERROR 1070:
Could not resolve datafu.pig.bags.Enumerate using imports: [,
java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] at
org.apache.pig.parser.LogicalPlanBuilder.buildUDF(LogicalPlanBuilder.java:1507)
at
org.apache.pig.parser.LogicalPlanGenerator.func_eval(LogicalPlanGenerator.java:9372)
at
org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:11051)
at
org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10810)
at
org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10159)
at
org.apache.pig.parser.LogicalPlanGenerator.nested_command(LogicalPlanGenerator.java:16315)
at
org.apache.pig.parser.LogicalPlanGenerator.nested_blk(LogicalPlanGenerator.java:16116)
at
org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:16024)
at
org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15849)
at
org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at
org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at
org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at
org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 17 more
Where do i need register datafu jar for pig scripts?I guess this is the issue.
Please help

You have to ensure the jar file is located in the same folder as your pigscript or ensure that the correct path is provided in the pigscript while registering the jar file. So in your case
Modify this
register 'datafu'
To
-- If,lets say datafu-1.2.0.jar is your jar file and is located in the same folder as your pigscript then in your pigscript at the top have this
REGISTER datafu-1.2.0.jar
-- Else,lets say datafu-1.2.0.jar is your jar file and is located in the folder /usr/hadoop/lib then in your pigscript at the top have this
REGISTER /usr/hadoop/lib/datafu-1.2.0.jar

pig -useHCatalog \
-x local \
-Dpig.additional.jars="/local/path/to/datafu.jar:/local/path//other.jar" \
/path/to/pigFinal.pig;
OR
in your pig script use fully qualified path
register /local/path/to/datafu.jar;

Related

How to load csv file from hdfs to hbase table using Dimporttsv

I am trying to load csv file into an hbase table using shell command Dimporttsv.
The csv files reside in a dir in my hdfs (/csvFiles)
the csv file was generated from a mysql table with the following feilds:
+-------------+
Field
+-------------+
tweet_id
user_id
screen_name
description
created_at
+-------------+
I created a table in hbase with a single family name as shown below:
create 'dummyTable', 'cf1'
the command I am using:
ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY,cf1:user_id,cf1:tweet_id,cf1:screen_name,cf1:description,cf1:created_at dummyTable /csvFiles/all_users.csv
however I am getting this syntax error:
SyntaxError: (hbase):8: syntax error, unexpected tSYMBEG
I've looked at the following posts and followed the recommendations in them but to no avail. I would appreciate your help.
Import TSV file into hbase table
https://community.hortonworks.com/articles/4942/import-csv-data-into-hbase-using-importtsv.html
http://hbase.apache.org/book.html#importtsv
Exit from Hbase shell and try by adding single quotes to importtsv.columns
bash$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns='HBASE_ROW_KEY,cf1:user_id,cf1:tweet_id,cf1:screen_name,cf1:description,cf1:created_at' dummyTable hdfs://<your_name_node_addr>/csvFiles/all_users.csv
(or)
From Hbase Shell:
hbase(main):001:0> ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns='HBASE_ROW_KEY,cf1:user_id,cf1:tweet_id,cf1:screen_name,cf1:description,cf1:created_at' dummyTable hdfs://<your_name_node_addr>/csvFiles/all_users.csv

How to do a bulkload to Hbase from CSV from command line

I am trying to do a bulkload which is a csv file using command line.
This is what I am trying
bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://localhost:9000/transactionsFile.csv bulkLoadtable
The error I am getting is below:
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:9000/transactionsFile.csv
15/09/01 13:49:44 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory hdfs://localhost:9000/transactionsFile.csv. Does it contain files in subdirectories that correspond to column family names?
Is it possible to do bulkload from command line without using java mapreduce.
You are almost correct, only thing missed is that the input to the bulkLoadtable must be directory. I suggest to keep the csv file under a directory and pass the path upto directory name as an argument to the command. Please refer the below link.
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.html#doBulkLoad(org.apache.hadoop.fs.Path,%20org.apache.hadoop.hbase.client.Admin,%20org.apache.hadoop.hbase.client.Table,%20org.apache.hadoop.hbase.client.RegionLocator)
Hope this helps.
You can do bulk load from command line,
There are multiple ways to do this,
a. Prepare your data by creating data files (StoreFiles) from a MapReduce job using HFileOutputFormat.
b. Import the prepared data using the completebulkload tool
eg: hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
more details,
hbase bulk load
2.
Using importtsv
eg:
hbase> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns="HBASE_ROW_KEY,id,temp:in,temp:out,vibration,pressure:in,pressure:out" sensor hdfs://sandbox.hortonworks.com:/tmp/hbase.csv
more details

"Doesn't exist in RM" backend error in Pig

I'm getting an error in the Cloudera QuickStart VM I downloaded from http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html.
I am trying a toy example from Tom White's Hadoop: The Definitive Guide book called map_temp.pig, which "finds the maximum temperature by year".
I created a file called temps.txt that contains (year, temperature, quality) entries on each line:
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
Using the example code in the book, I typed the following Pig code into the Grunt terminal:
records = LOAD '/home/cloudera/Desktop/temps.txt'
AS (year:chararray, temperature:int, quality:int);
DUMP records;
After I typed DUMP records;, I got the error:
2014-05-22 11:33:34,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias records. Backend error : org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1400775973236_0006' doesn't exist in RM.
…
Details at logfile: /home/cloudera/Desktop/pig_1400782722689.log
I attempted to find out what was causing the error through a Google search: https://www.google.com/search?q=%22application+with+id%22+%22doesn%27t+exist+in+RM%22.
The results there weren't helpful. For example, http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-troubleshoot-error-vpc.html mentioned this bug and said "To solve this problem, you must configure a VPC that includes a DHCP Options Set whose parameters are set to the following values..."
Amazon's suggested fix doesn't seem to be the problem because I'm not using using AWS.
EDIT:
I think the HDFS file path is correct.
[cloudera#localhost Desktop]$ ls
Eclipse.desktop gnome-terminal.desktop max_temp.pig temps.txt
[cloudera#localhost Desktop]$ pwd
/home/cloudera/Desktop
there's another exception before your error :
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://localhost.localdomain:8020/home/cloudera/Desktop/temps.txt
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
Is your file in HDFS? Have you checked the file path?
I was able to solve this problem by doing pig -x local to start the Grunt interpreter instead of just pig.
I should have used local mode because I did not have access to a Hadoop cluster.
This gave me the errors:
2014-05-22 11:33:34,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias records. Backend error : org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1400775973236_0006' doesn't exist in RM.
2014-05-22 11:33:28,799 [JobControl] WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://localhost.localdomain:8020/home/cloudera/Desktop/temps.txt
From http://pig.apache.org/docs/r0.9.1/start.html:
Pig has two execution modes or exectypes:
Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).
You can run Pig in either mode using the "pig" command (the bin/pig Perl script) or the "java" command (java -cp pig.jar ...).
Running the toy example from Tom White's Hadoop: The Definitive Guide book:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'temps.txt' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
against the following data set in temps.txt (remember that Pig's default input is tab-delimited files):
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
gives this:
grunt> [cloudera#localhost Desktop]$ pig -x local -f max_temp.pig 2>log
(1949,111)
(1950,22)

Bulk load in hbase using pig

I have a log file in HDFS which needs to be parsed and put in a Hbase table.
I want to do this using PIG .
How can i go about it.Pig script should parse the logs and then put in Hbase?
The pig script would be (assuming tab is your data separator in log file):
A= load '/home/log.txt' using PigStorage('\t') as (one:chararray,two:chararray,three:chararray,four:chararray);
STORE A INTO 'hbase://table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('P:one,P:two,S:three,S:four');

Pig is not loading csv

I'm trying to load a pipe delimited file ('|') in pig using the following command:
A = load 'test.csv' using PigStorage('|');
But I keep getting this error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. java.net.URISyntaxException cannot be cast to java.lang.Error
I've looked all over, but I can't find any reason this would happen. The test file I have above is a simple file that just contains 1|2|3 for testing.
If you are running Pig in MAPREDUCE as the ExecType mode, then the following command should work
A = LOAD '/user/pig/input/pipetest.csv' USING PigStorage('|');
DUMP A;
Here is the output on your screen
(1,2,3)
Note that I have included the full path in HDFS for my csv file in the LOAD command

Resources