Hadoop and PiggyBank incompatibility - hadoop

I am trying to use org.apache.pig.piggybank.storage.MultiStorage from piggybank.jar archive. I downloaded pig trunk and built piggybank.jar by following the instructions here. However, I get the error below when I use the MultiStorage class.
Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
By looking here, it looks like there is a version incompatibility between the piggybank build and the hadoop version. But I am not able to fix this issue. I really appreciate any help on this (spent inordinate amount of time on this already).
pig.hadoop.version: 2.0.0-cdh4.1.0
> hadoop version
Hadoop 2.0.0-cdh4.1.0 Subversion
file:///data/1/jenkins/workspace/generic-package-ubuntu64-10-04/CDH4.1.0-Packaging-Hadoop-2012-09-29_10-56-25/hadoop-2.0.0+541-1.cdh4.1.0.p0.27~lucid/src/hadoop-common-project/hadoop-common -r 5c0a0bddbc2aaff30a8624b5980cd4a2e1b68d18 Compiled by jenkins on Sat Sep 29 11:26:31 PDT 2012 From source with checksum
95f5c7f30b4030f1f327758e7b2bd61f

Though I am not able to figure out how to build a compatible piggybank.jar, I found that a compatible piggybank.jar is located under /usr/lib/pig/.

I faced a similar issue when I used piggybank version 0.13 with Hadoop version Hadoop 2.4.0.2.1.5.0-695. It however worked when I used the piggybank jar in the location you mentioned -- /usr/lib/pig.
The additional observation I made is the piggybank jar in /usr/lib/pig is quite old and does not have XPath and other functions available. I believe new piggy jar has dependencies on later Hadoop version.

Related

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStatistics class. I have hadoop 2.8.x defined in my pom.xml file but trying to test it by running it locally.
How can I make it ignore searching for that or what workaround options are there to include that class if I have to go with hadoop 2.7.3?
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
at com.ibm.cos.jdbc2DF$.main(jdbc2DF.scala:153)
at com.ibm.cos.jdbc2DF.main(jdbc2DF.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StorageStatistics
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 28 more
You can't mix bits of Hadoop and expect things to work. It's not just the close coupling between internal classes in hadoop-common and hadoop-aws, its things like the specific version of the amazon-aws SDK the hadoop-aws module was built it.
If you get ClassNotFoundException or MethodNotFoundException stack traces when trying to work with s3a:// URLs, JAR version mismatch is the likely cause.
Using the RFC2117 MUST/SHOULD/MAY terminology, here are the rules to avoid this situation:
The s3a connector is in hadoop-aws JAR; it depends on hadoop-common and the aws-sdk-shaded JARs.
all these JARs MUST be on the classpath.
All versions of the hadoop-* JARs on your classpath MUST be exactly the same version, e.g 3.3.1 everywhere, or 3.2.2. Otherwise: stack trace. Always
And they MUST be exclusively of that version; there MUST NOT be multiple versions of hadoop-common, hadoop-aws etc on the classpath. Otherwise: stack trace. Always. Usually ClassNotFoundException indicating a mismatch in hadoop-common and hadoop-aws.
The exact missing class varies across Hadoop releases: it's the first class depended on by org.apache.fs.s3a.S3AFileSystem which the classloader can't find -the exact class depends on the mismatch of JARs
The AWS SDK jar SHOULD be the huge aws-java-sdk-bundle JAR, unless you know exactly which bits of the AWS SDK stack you need *and are confident all transitive dependencies (jackson, httpclient, ...) are in your Spark distribution and compatible. Otherwise: missing classes or odd runtime issues.
There MUST NOT be any other AWS SDK jars on your classpath. Otherwise: duplicate classes and general classpath problems.
The AWS SDK version SHOULD be the one shipped. Otherwise: maybe stack trace, maybe not. Either way -you are in self-support mode or have opted to join a QE team for version testing.
The specific version of the AWS SDK you need can be determined from Maven Repository
Changing the AWS SDK versions MAY work. You get to test, and if there are compatibility problems: you get to fix. See Qualifying an AWS SDK Update for the least you should be doing.
You SHOULD use the most recent versions of Hadoop you can/Spark is tested with. Non-critical bug fixes do not get backported to old Hadoop releases, and the S3A and ABFS connectors are rapidly evolving. New releases will be better, stronger, faster. Generally
If none of this works. a bug report filed on the ASF JIRA server will get closed as WORKSFORME. Config issues aren't treated as code bugs
Finally: the ASF documentation: The S3A Connector.
Note: that link is to the latest release. If you are using an older release it will lack features. Upgrade before complaining that the s3a connector doesn't do what the documentation says it does.
I found stevel's answer above to be extremely helpful. His information inspired my write-up here. I will copy the relevant parts below. My answer is tailored to a Python/Windows context, but I suspect most points are still relevant in a JVM/Linux context.
Dependencies
This answer is intended for Python developers, so it assumes we will install Apache Spark indirectly via pip. When pip installs PySpark, it collects most dependencies automatically, as seen in .venv/Lib/site-packages/pyspark/jars. However, to enable the S3A Connector, we must track down the following dependencies manually:
JAR file: hadoop-aws
JAR file: aws-java-sdk-bundle
Executable: winutils.exe (and hadoop.dll) <-- Only needed in Windows
Constraints
Assuming we're installing Spark via pip, we can't pick the Hadoop version directly. We can only pick the PySpark version, e.g. pip install pyspark==3.1.3, which will indirectly determine the Hadoop version. For example, PySpark 3.1.3 maps to Hadoop 3.2.0.
All Hadoop JARs must have the exact same version, e.g. 3.2.0. Verify this with cd pyspark/jars && ls -l | grep hadoop. Notice that pip install pyspark automatically included some Hadoop JARs. Thus, if these Hadoop JARs are 3.2.0, then we should download hadoop-aws:3.2.0 to match.
winutils.exe must have the exact same version as Hadoop, e.g. 3.2.0. Beware, winutils releases are scarce. Thus, we must carefully pick our PySpark/Hadoop version such that a matching winutils version exists. Some PySpark/Hadoop versions do not have a corresponding winutils release, thus they cannot be used on Windows.
aws-java-sdk-bundle must be compatible with our hadoop-aws choice above. For example, hadoop-aws:3.2.0 depends on aws-java-sdk-bundle:1.11.375, which can be verified here.
Instructions
With the above constraints in mind, here is a reliable algorithm for installing PySpark with S3A support on Windows:
Find latest available version of winutils.exe here. At time of writing, it is 3.2.0. Place it at C:/hadoop/bin. Set environment variable HADOOP_HOME to C:/hadoop and (important!) add %HADOOP_HOME%/bin to PATH.
Find latest available version of PySpark that uses Hadoop version equal to above, e.g. 3.2.0. This can be determined by browsing PySpark's pom.xml file across each release tag. At time of writing, it is 3.1.3.
Find the version of aws-java-sdk-bundle that hadoop-aws requires. For example, if we're using hadoop-aws:3.2.0, then we can use this page. At time of writing, it is 1.11.375.
Create a venv and install the PySpark version from step 2.
python -m venv .venv
source .venv/Scripts/activate
pip install pyspark==3.1.3
Download the AWS JARs into PySpark's JAR directory:
cd .venv/Lib/site-packages/pyspark/jars
ls -l | grep hadoop
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
Download winutils:
cd C:/hadoop/bin
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/winutils.exe
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/hadoop.dll
Testing
To verify your setup, try running the following script.
import pyspark
spark = (pyspark.sql.SparkSession.builder
.appName('my_app')
.master('local[*]')
.config('spark.hadoop.fs.s3a.access.key', 'secret')
.config('spark.hadoop.fs.s3a.secret.key', 'secret')
.getOrCreate())
# Test reading from S3.
df = spark.read.csv('s3a://my-bucket/path/to/input/file.csv')
print(df.head(3))
# Test writing to S3.
df.write.csv('s3a://my-bucket/path/to/output')
You'll need to substitute your AWS keys and S3 paths, accordingly.
If you recently updated your OS environment variables, e.g. HADOOP_HOME and PATH, you might need to close and re-open VSCode to reflect that.

Apache Ignite: What are the dependencies of IgniteHadoopIgfsSecondaryFileSystem?

I am trying to setup IGFS with Hadoop as the secondary storage. I have set my configuration as shown here but I keep getting NoClassDefFoundErrors. I have downloaded both binary distributions of Ignite and have tried building from source also but the dependencies are not included. hadoop-common-2.6.0.jar and ignite-hadoop-1.4.0.jar provided some of the dependencies but now I am getting a NoClassDefFoundError for org/apache/hadoop/mapred/JobConf which by my understanding is a deprecated class...
I have been following the instructions on the Apache Ignite website but this is as far as I've gotten.
What dependencies do I need for IgniteHadoopIgfsSecondaryFileSystem as the secondary storage?
It looks like the problem is that Ignite node does not have Hadoop libraries on the classpath. To fix that please try to do the following:
1) use "Hadoop Accelerator" edition of Ignite distribution (use -Dignite.edition=hadoop if you're building the distribution yourself).
2) Set HADOOP_HOME environment variable for the Ignite process if you're using Apache Hadoop distribution, or, if you use another distribution (HDP, Cloudera, BigTop, etc.) make sure /etc/default/hadoop file exists and has appropriate contents.
Alternatively, you can manually add necessary Hadoop dependencies to Ignite node classpath: these are dependencies of groupId "org.apache.hadoop" listed in file modules/hadoop/pom.xml . Currently they are:
hadoop-annotations
hadoop-auth
hadoop-common
hadoop-hdfs
hadoop-mapreduce-client-common
hadoop-mapreduce-client-core
If you don't want to deal with dependency management yourself - which is a real hard thing to do manually - I'd suggest you look at the projects providing orchestration and deployment services for software stacks. Check Apache Bigtop (bigtop.apache.org), that provides pre-cut linux packages for Apache Ignite, Hadoop, HDFS and pretty much anything else in this space. You can grab the latest nightly packages from our CI at http://ci.bigtop.apache.org/view/Packages/job/Bigtop-trunk-packages

How to find jar dependencies when running Apache Pig script?

I am having some difficulties running a simple pig script to import data into HBase using HBaseStorage
The error I have encountered is given by:
Caused by: <file demo.pig, line 14, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.backend.hadoop.hbase.HBaseStorage' with arguments '[rdf:predicate rdf:object]'
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.initScan(HBaseStorage.java:427)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:368)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:239) 13_21.51.28.tar.gz
... 29 more
According to other questions and threads, the main response/answer to this issue would be to register the appropriate jars required for the HBaseStorage references. What I am stumped by is how am I supposed to identify the required JAR given the appropriate Pig function.
I even tried to open the various jar files under the hbase and pig folders to ensure the appropriate classes are registered in the pig script.
For example, since java.lang.NoSuchMethodError was caused by org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
I imported specifically the jar that contains org.apache.hadoop.hbase.client.Scan, to no avail.
Pig's documentation does not provide any obvious links and help that I can refer to.
I am using Hadoop 2.7.0, HBase 1.0.1.1., Pig 0.15.0.
If you need any other clarification, feel free to ask me again. Would really appreciate it if someone could help me out with this issue.
Also, is it better to install Hadoop and the relevant softwares from scratch, or is it better to directly get one of the Hadoop bundles available?
There is something wrong with the released jar: hbase-client-1.0.1.1.jar
you can test it with this code, the error will show up:
Scan scan = new Scan();
scan.setCacheBlocks(true);
I've tried other set functions, like setCaching, it throws the same error. While I checked the source code, those functions exist. Maybe just compile hbase-client-1.0.1.1.jar manually, I'm still looking for better solution...
============
Update for above, found the root cause is hbase-client-1.0.1.1.jar incompatibility with older versions.
https://issues.apache.org/jira/browse/HBASE-10841
https://issues.apache.org/jira/browse/HBASE-10460
There is a change of return value for set functions, jars compiled with old version won't work with current.
For your question, you can modify the pig script $PIG_HOME/bin/pig, set debug=true, then it will just print running info.
Did you register required jars.
Most important jars habse,zookeeper and guava
I solved the similar kind of issue by registering zookeeper jar in my pigscript

Eclipse plugin error for Hadoop on Ubuntu

I installed Hadoop version 1.0.3 and its related eclipse plugin successfully. All the Hadoop functionalities and examples are working pretty well, but when I want to use its plugin on eclipse, it could not connect to hdfs and I get the error:
An internal error occurred during: "Connecting to DFS localhost".
org/apache/commons/configurati­on/Configuration.
could anybody help me how to solve this problem!
Thanks
You are facing this problem because the plugin is missing some necessary jars. In order to solve the problem you need to rebuild the plugin after including the necessary jars. I have seen this kind of questions a lot on SO, and they all point out to the same thing. Please see these links :
Eclipse Hadoop plugin issue(Call to localhost/127.0.0.1:50070 )Can any body give me the solution for this?
Hadoop eclipse mapreduce is not working?
Installing Hadoop's Eclipse Plugin
I did follow the following blog instructions to make Hadoop eclipse plugin 1.0.4 :
http://iredlof.com/part-4-compile-hadoop-v1-0-4-eclipse-plugin-on-ubuntu-12-10/
but it seems it has some missing parts like:
in MANIFEST.MF you should add:
/lib/commons-cli-1.2.jar
and in build-contrib.xml you should also add:
<property name="commons-cli.version" value="1.2"/>
I hope these are useful!
you must run hadoop with command line first!!
./[hadoop-path]/bin/start-all.sh

Eclipse plugins in hadoop on windows

I'm new to the hadoop.I am trying to install hadoop on my windows machine with the help of following link i.e. http://blog.v-lad.org/archives/4#comment-43
I'm using eclipse IDE:3.3.1
javaJDK :1.6.0_24
Hadoop :0.21.0
Everything fine, eclipse IDE when i select the "new hadoop location" action is not perform.I didn't get the problem .Any one can help me
Have you added the eclipse-plugin to the installation directory of eclipse?
...\hadoop-0.21.0\mapred\contrib\eclipse-plugin\hadoop-0.21.0-eclipse-plugin.jar
to
...\eclipse\dropins
P.S. the eclipse-plugin of Hadoop :0.21.0 is not complete.
You can download the revised one at http://www.lifeba.org/wp-content/uploads/2012/03/hadoop-0.21.0-eclipse-plugin-3.6.rar
Although it's for eclipse 3.6 ,I suppose it's also compatible for eclipse 3.3.1
——————————————————————————————
Oh, I just noticed the time...years ago... sorry
I hope it be useful for those who encountered this problem recently.

Resources