PIG command execution - hadoop

I am learning Hadoop by myself so I am not sure if what I asking is even a problem. When I run the command pig -x local to run it locally, i get the following message:
15/10/05 15:23:28 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/05 15:23:28 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2015-10-05 15:23:28,830 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-05 15:23:28,831 [main] INFO org.apache.pig.Main - Logging error messages to: /home/nkhl/pig_1444038808829.log
2015-10-05 15:23:29,050 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/nkhl/.pigbootup not found
2015-10-05 15:23:29,333 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-05 15:23:29,334 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-05 15:23:29,335 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2015-10-05 15:23:29,562 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
It looks different on my online tutor's screen so I am a little confused.
What concerns me most is the deprecation part. Can someone help me with that please? What is it trying to say? Don't get me wrong, everything works fine. The GRUNT shell loads up, and things execute fine. I just wanted to know what that meant.
It's an Ubuntu machine.
Thanks!

Running pig as local is great AFAIK if you are using for some quick testing.Like displaying the sysout in UDF etc.
The above warnings you can safely ignore.It is saying that some of the variables set in conf-site.xml are deprecated.
You can switch off those parameters by editing the
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation
in log4j.properties file.

You have some Hadoop-related variables set, such as HADOOP_HOME or HADOOP_PREFIX or HADOOP_CONF_DIR, which aren't needed if you are running Pig in local mode.
unset HADOOP_HOME
unset HADOOP_PREFIX
unset HADOOP_CONF_DIR
Deprecations aren't scary. They are a reminder that the code is calling on something that will eventually go away in a future version. These specific deprecations are caused by differences between Hadoop 1 vs Hadoop 2. Pig is compatible with both versions. If you happened to be using Hadoop 1.2.1 instead of 2.x, you wouldn't see the warnings. This is because Pig is checking the Hadoop 1 values first.
If you're interested in learning more, you can check out the Pig source code.
https://github.com/apache/pig/blob/release-0.15.0/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java#L219-L222

Related

Flag -useHCatalog not working

I installed CDH5.4 in single node following the instructions here, also, I put the hive-metastore in localmode using these instructions and everything works perfectly, except when I tried to connect pig with the metastore:
➜ ~ pig -useHCatalog
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2015-05-01 15:45:08,657 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0-cdh5.4.0 (rUnversioned directory) compiled Apr 21 2015, 12:19:15
2015-05-01 15:45:08,658 [main] INFO org.apache.pig.Main - Logging error messages to: /home/itam/pig_1430495108571.log
2015-05-01 15:45:09,035 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:09,035 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:09,035 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020
2015-05-01 15:45:09,940 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:09,941 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021
2015-05-01 15:45:09,941 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:09,999 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,001 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:10,088 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,089 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:10,125 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,126 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:10,160 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,162 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:10,194 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,195 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:10,227 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,228 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:10,261 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,262 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-05-01 15:45:10,295 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-05-01 15:45:10,296 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
and when I tried to access the table:
grunt> a = load 'ufos' using org.apache.hcatalog.pig.HCatLoader();
2015-05-01 15:46:11,656 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/itam/pig_1430495108571.log
grunt>
Hadoop version
➜ ~ hadoop version
Hadoop 2.6.0-cdh5.4.0
Subversion http://github.com/cloudera/hadoop -r c788a14a5de9ecd968d1e2666e8765c5f018c271
Compiled by jenkins on 2015-04-21T19:16Z
Compiled with protoc 2.5.0
From source with checksum cd78f139c66c13ab5cee96e15a629025
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.4.0.jar
UPDATE: I just tried with Impala, and It neither sees anything:
➜ ~ impala-shell
/usr/lib/python2.7/dist-packages/pkg_resources.py:1049: UserWarning: /home/itam/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extracti
on_path or the PYTHON_EGG_CACHE environment variable).
warnings.warn(msg, UserWarning)
Starting Impala Shell without Kerberos authentication
Connected to 6b512e41337d:21000
Server version: impalad version 2.2.0-cdh5 RELEASE (build 2ffd73a4255cefd521362ffe1cfb37463f67f75c)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v2.2.0-cdh5 (2ffd73a) built on Tue Apr 21 12:09:21 PDT 2015)
[6b512e41337d:21000] > invalidate metadata;
Query: invalidate metadata
[6b512e41337d:21000] > show tables;
Query: show tables
Fetched 0 row(s) in 0.00s
but from beeline:
~ beeline -u jdbc:hive2://
scan complete in 2ms
Connecting to jdbc:hive2://
Connected to: Apache Hive (version 1.1.0-cdh5.4.0)
Driver: Hive JDBC (version 1.1.0-cdh5.4.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.4.0 by Apache Hive
0: jdbc:hive2://> show tables;
OK
+-----------+--+
| tab_name |
+-----------+--+
| ufos |
+-----------+--+
1 row selected (0.701 seconds)
It worked... What is happening?
UPDATE: I am running hcatalog too
➜ ~ sudo service hive-webhcat-server status
* WEBHCat server is running
➜ ~ hcat -e "desc ufos"
OK
timestamp string from deserializer
city string from deserializer
state string from deserializer
shape string from deserializer
duration string from deserializer
summary string from deserializer
posted string from deserializer
Time taken: 1.314 seconds
UPDATE: The problem with impala was due that I didn't copy hive-site.xml to /etc/impala/conf, once this is done, impala-shell worked properly.
The loader you are using is deprecated. Instead of using org.apache.hcatalog.pig.HCatLoader, you need to use org.apache.hive.hcatalog.pig.HCatLoader.
From org.apache.hcatalog.pig.HCatLoader:
Deprecated.
Use/modify HCatLoader instead
I was facing the issue in HDP 2.3 and Pig 0.15 .
Package name for HCatLoader() class is different in Hortonworks distribution.
The following worked for me
USING org.apache.hive.hcatalog.pig.HCatLoader()
instead of
USING org.apache.hcatalog.pig.HCatLoader();
like you started to see the issues you have is with hive-site.xmlfile - you need to place it in the classpath
As mention here:
A workflow action interacting with HCatalog requires the following
jars in the classpath: hcatalog-core.jar, webhcat-java-client.jar,
hive-common.jar, hive-exec.jar, hive-metastore.jar, hive-serde.jar and
libfb303.jar. hive-site.xml which has the configuration to talk to the
HCatalog server also needs to be in the classpath. The correct version
of HCatalog and hive jars should be placed in classpath based on the
version of HCatalog installed on the cluster.
The jars can be added to the classpath of the action using one of the
below ways.
You can place the jars and hive-site.xml in the system shared library.
The shared library for a pig, hive or java action can be overridden to
include hcatalog shared libraries along with the action's shared
library. Refer to Shared Libraries for more information. The
oozie-sharelib-[version].tar.gz in the oozie distribution bundles the
required HCatalog jars in a hcatalog sharelib. If using a different
version of HCatalog than the one bundled in the sharelib, copy the
required HCatalog jars from such version into the sharelib.
You can
place the jars and hive-site.xml in the workflow application lib/
path.
You can specify the location of the jar files in archive tag and
the hive-site.xml in file tag in the corresponding pig, hive or java
action.
If you are going to use Oozie coordinator, upload them to HDFS coordinator path

Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available

when i want to start base shell i get this error :
[main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
like this :
root#SE ~ # ./hbase/bin/hbase shell
2015-02-15 20:17:51,925 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.10-hadoop2, rb18bc4b06f3eb90f592c906e78fb6461548ae627, Sun Feb 1 05:48:33 UTC 2015
hbase(main):001:0>
how should i fix this error ?
This is a known issue of Hbase. It has been tracked here and the status shows it is fixed in newer versions.
The Hbase operation is not affected. There is even a tutorial on Hbase that includes these warnings as normal and includes them in the expected outputs:
HBase Basics.

PIG setup throwing error

I was trying to install PIG v0.13.0 in my Fedora 20 system. After extracting the tar.gz contents, I did the PATH setup for JAVA_HOME and PIG/bin. Then I type the command pig in the console and this is what I got: Unable to understand what went wrong:
[root#localhost /]# pig
14/12/21 00:05:15 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/12/21 00:05:15 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/12/21 00:05:15 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-12-21 00:05:16,082 [main] INFO org.apache.pig.Main - Apache Pig version 0.13.0 (r1606446) compiled Jun 29 2014, 02:27:58
2014-12-21 00:05:16,083 [main] INFO org.apache.pig.Main - Logging error messages to: //pig_1419100516081.log
2014-12-21 00:05:16,130 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2014-12-21 00:05:16,765 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-12-21 00:05:16,771 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-12-21 00:05:16,771 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020
2014-12-21 00:05:16,780 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2014-12-21 00:05:19,130 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-12-21 00:05:19,130 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021
2014-12-21 00:05:19,136 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt> ls
2014-12-21 00:05:33,697 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. Call From localhost.localdomain/127.0.0.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Details at logfile: //pig_1419100516081.log
Please let me know why did the ls command in grunt shell throw the error?
Please guide.
When you type pig in console, by default it will go to MAPREDUCE mode, for that you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode in pig.
It looks like your hadoop cluster is not configured properly that is the reason you are getting the connection refunded error. Please follow up this link to solve this connect-refused problem.http://wiki.apache.org/hadoop/ConnectionRefused.
As a workaround use LOCAL mode, this doesn't need hadoop installation.
In the console type pig -x local this will bring the grunt shell and type ls command.
Local mode
$ pig -x local
Mapreduce mode
$ pig
(or) //try to connect HDFS
$ pig -x mapreduce
Ok I got this one working. if I connect to the pig mapreduce mode the the ls command will change to ls hdfs:/. Hence changing the above command from ls to ls hdfs:/ resolves my problem. But again, if I am connecting to the local mode then the ls command works fine.

Pig error while while entering simple scripts in hadoop 2 environment

I am using hadoop-2.5.1 and pig-0.13.0, and my hadoop cluster running very well. When I try to run simple pig script
test = load '/input-data/data10' using PigStorage(',');
I am getting an error:
2014-11-13 15:41:19,278 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-11-13 15:41:19,279 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.addres
Please if any one having solution let me know.
This is not an error, it's just an info logging.
If your script does not attemp to perform any action on loaded data and doesn't throw any error, then it probably works fine.
Try to add some actions with test variable, for example:
DUMP test;

Pig Error: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

I have just upgraded Pig 0.12.0 to 0.13.0 version on Hortonworks HDP 2.1
I am getting below error when I am trying to use XMLLoader in my script, even though I have registered piggybank already.
Script:
A = load 'EPAXMLDownload.xml' using org.apache.pig.piggybank.storage.XMLLoader('Document') as (x:chararray);
Error:
dump A
2014-08-10 23:08:56,494 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-08-10 23:08:56,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-10 23:08:56,651 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-08-10 23:08:56,727 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-08-10 23:08:57,191 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-08-10 23:08:57,199 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-10 23:08:57,214 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-08-10 23:08:57,223 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-08-10 23:08:57,247 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
Note that pig decides the hadoop version depending on which context var you have set
HADOOP_HOME -> v1
HADOOP_PREFIX -> v2
If you use hadoop2, you need to recompile the piggybank (which is by default compiled for hadoop1)
go to pig/contrib/piggybank/java
$ ant -Dhadoopversion=23
then copy that jar over pig/lib/piggybank.jar
A few more details because the other answers didn't work for me:
Git clone the pig git mirror https://github.com/apache/pig
cd into the cloned directory
if you've already built pig in the past in this directory, you should run a clean
ant clean
build pig for hadoop 2
ant -Dhadoopversion=23
cd into piggybank
cd contrib/piggybank/java
again, if you've build piggybank before, make sure to clean out the old build files
ant clean
build piggybank for hadoop 2 (same command, different directory)
ant -Dhadoopversion=23
If you don't build pig first, piggybank will throw a bunch of "symbol not found" exceptions while compiling. In addition, since I had previously built pig for Hadoop 1 (accidentally), without running a clean, I ran into runtime errors.
Some times you may get problem after installing Pig like below:-
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.hcatalog.common.HCatUtil.checkJobContextIfRunningFromBackend(HCatUtil.java:88)
at org.apache.hcatalog.pig.HCatLoader.setLocation(HCatLoader.java:162)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:540)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:322)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:199)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:277)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1367)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1352)
at org.apache.pig.PigServer.execute(PigServer.java:1341)
Many blogs suggest you to recompile the Pig by executing command:
ant clean jar-all -Dhadoopversion=23
or recompile piggybank.jar by executing below steps
cd contrib/piggybank/java
ant clean
ant -Dhadoopversion=23
But this may not solve your problem big time. The actual cause here is related to HCatalog. Try updating it!!. In my case, I was using Hive0.13 and Pig.0.13. And I was using HCatalog provided with Hive0.13.
Then I updated Pig to 0.15 and used separate hive-hcatalog-0.13.0.2.1.1.0-385 library jars. And problem was resolved....
Because later I identified it was not Pig who was creating problem rather it was Hive-HCatalog libraries. Hope this may help.
Even i faced the same error with Hadoop version 2.2.0.
The work around is, we have to register following jar files using the grunt shell.
The paths that i am gonna paste below will be according to the hadoop-2.2.0 version. Kindly find the jars according to your version.
/hadoop-2.2.0/share/hadoop/mapreduce/ hadoop-mapreduce-client-core-2.2.0.jar
/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
Using the REGISTER command we have to register these jars along with piggybank.
Run the pig script/command now and revert if you face any issue.

Resources