How to find jar dependencies when running Apache Pig script? - hadoop

I am having some difficulties running a simple pig script to import data into HBase using HBaseStorage
The error I have encountered is given by:
Caused by: <file demo.pig, line 14, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.backend.hadoop.hbase.HBaseStorage' with arguments '[rdf:predicate rdf:object]'
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.initScan(HBaseStorage.java:427)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:368)
at org.apache.pig.backend.hadoop.hbase.HBaseStorage.<init>(HBaseStorage.java:239) 13_21.51.28.tar.gz
... 29 more
According to other questions and threads, the main response/answer to this issue would be to register the appropriate jars required for the HBaseStorage references. What I am stumped by is how am I supposed to identify the required JAR given the appropriate Pig function.
I even tried to open the various jar files under the hbase and pig folders to ensure the appropriate classes are registered in the pig script.
For example, since java.lang.NoSuchMethodError was caused by org.apache.hadoop.hbase.client.Scan.setCacheBlocks(Z)V
I imported specifically the jar that contains org.apache.hadoop.hbase.client.Scan, to no avail.
Pig's documentation does not provide any obvious links and help that I can refer to.
I am using Hadoop 2.7.0, HBase 1.0.1.1., Pig 0.15.0.
If you need any other clarification, feel free to ask me again. Would really appreciate it if someone could help me out with this issue.
Also, is it better to install Hadoop and the relevant softwares from scratch, or is it better to directly get one of the Hadoop bundles available?

There is something wrong with the released jar: hbase-client-1.0.1.1.jar
you can test it with this code, the error will show up:
Scan scan = new Scan();
scan.setCacheBlocks(true);
I've tried other set functions, like setCaching, it throws the same error. While I checked the source code, those functions exist. Maybe just compile hbase-client-1.0.1.1.jar manually, I'm still looking for better solution...
============
Update for above, found the root cause is hbase-client-1.0.1.1.jar incompatibility with older versions.
https://issues.apache.org/jira/browse/HBASE-10841
https://issues.apache.org/jira/browse/HBASE-10460
There is a change of return value for set functions, jars compiled with old version won't work with current.
For your question, you can modify the pig script $PIG_HOME/bin/pig, set debug=true, then it will just print running info.

Did you register required jars.
Most important jars habse,zookeeper and guava
I solved the similar kind of issue by registering zookeeper jar in my pigscript

Related

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStatistics class. I have hadoop 2.8.x defined in my pom.xml file but trying to test it by running it locally.
How can I make it ignore searching for that or what workaround options are there to include that class if I have to go with hadoop 2.7.3?
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
at com.ibm.cos.jdbc2DF$.main(jdbc2DF.scala:153)
at com.ibm.cos.jdbc2DF.main(jdbc2DF.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StorageStatistics
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 28 more
You can't mix bits of Hadoop and expect things to work. It's not just the close coupling between internal classes in hadoop-common and hadoop-aws, its things like the specific version of the amazon-aws SDK the hadoop-aws module was built it.
If you get ClassNotFoundException or MethodNotFoundException stack traces when trying to work with s3a:// URLs, JAR version mismatch is the likely cause.
Using the RFC2117 MUST/SHOULD/MAY terminology, here are the rules to avoid this situation:
The s3a connector is in hadoop-aws JAR; it depends on hadoop-common and the aws-sdk-shaded JARs.
all these JARs MUST be on the classpath.
All versions of the hadoop-* JARs on your classpath MUST be exactly the same version, e.g 3.3.1 everywhere, or 3.2.2. Otherwise: stack trace. Always
And they MUST be exclusively of that version; there MUST NOT be multiple versions of hadoop-common, hadoop-aws etc on the classpath. Otherwise: stack trace. Always. Usually ClassNotFoundException indicating a mismatch in hadoop-common and hadoop-aws.
The exact missing class varies across Hadoop releases: it's the first class depended on by org.apache.fs.s3a.S3AFileSystem which the classloader can't find -the exact class depends on the mismatch of JARs
The AWS SDK jar SHOULD be the huge aws-java-sdk-bundle JAR, unless you know exactly which bits of the AWS SDK stack you need *and are confident all transitive dependencies (jackson, httpclient, ...) are in your Spark distribution and compatible. Otherwise: missing classes or odd runtime issues.
There MUST NOT be any other AWS SDK jars on your classpath. Otherwise: duplicate classes and general classpath problems.
The AWS SDK version SHOULD be the one shipped. Otherwise: maybe stack trace, maybe not. Either way -you are in self-support mode or have opted to join a QE team for version testing.
The specific version of the AWS SDK you need can be determined from Maven Repository
Changing the AWS SDK versions MAY work. You get to test, and if there are compatibility problems: you get to fix. See Qualifying an AWS SDK Update for the least you should be doing.
You SHOULD use the most recent versions of Hadoop you can/Spark is tested with. Non-critical bug fixes do not get backported to old Hadoop releases, and the S3A and ABFS connectors are rapidly evolving. New releases will be better, stronger, faster. Generally
If none of this works. a bug report filed on the ASF JIRA server will get closed as WORKSFORME. Config issues aren't treated as code bugs
Finally: the ASF documentation: The S3A Connector.
Note: that link is to the latest release. If you are using an older release it will lack features. Upgrade before complaining that the s3a connector doesn't do what the documentation says it does.
I found stevel's answer above to be extremely helpful. His information inspired my write-up here. I will copy the relevant parts below. My answer is tailored to a Python/Windows context, but I suspect most points are still relevant in a JVM/Linux context.
Dependencies
This answer is intended for Python developers, so it assumes we will install Apache Spark indirectly via pip. When pip installs PySpark, it collects most dependencies automatically, as seen in .venv/Lib/site-packages/pyspark/jars. However, to enable the S3A Connector, we must track down the following dependencies manually:
JAR file: hadoop-aws
JAR file: aws-java-sdk-bundle
Executable: winutils.exe (and hadoop.dll) <-- Only needed in Windows
Constraints
Assuming we're installing Spark via pip, we can't pick the Hadoop version directly. We can only pick the PySpark version, e.g. pip install pyspark==3.1.3, which will indirectly determine the Hadoop version. For example, PySpark 3.1.3 maps to Hadoop 3.2.0.
All Hadoop JARs must have the exact same version, e.g. 3.2.0. Verify this with cd pyspark/jars && ls -l | grep hadoop. Notice that pip install pyspark automatically included some Hadoop JARs. Thus, if these Hadoop JARs are 3.2.0, then we should download hadoop-aws:3.2.0 to match.
winutils.exe must have the exact same version as Hadoop, e.g. 3.2.0. Beware, winutils releases are scarce. Thus, we must carefully pick our PySpark/Hadoop version such that a matching winutils version exists. Some PySpark/Hadoop versions do not have a corresponding winutils release, thus they cannot be used on Windows.
aws-java-sdk-bundle must be compatible with our hadoop-aws choice above. For example, hadoop-aws:3.2.0 depends on aws-java-sdk-bundle:1.11.375, which can be verified here.
Instructions
With the above constraints in mind, here is a reliable algorithm for installing PySpark with S3A support on Windows:
Find latest available version of winutils.exe here. At time of writing, it is 3.2.0. Place it at C:/hadoop/bin. Set environment variable HADOOP_HOME to C:/hadoop and (important!) add %HADOOP_HOME%/bin to PATH.
Find latest available version of PySpark that uses Hadoop version equal to above, e.g. 3.2.0. This can be determined by browsing PySpark's pom.xml file across each release tag. At time of writing, it is 3.1.3.
Find the version of aws-java-sdk-bundle that hadoop-aws requires. For example, if we're using hadoop-aws:3.2.0, then we can use this page. At time of writing, it is 1.11.375.
Create a venv and install the PySpark version from step 2.
python -m venv .venv
source .venv/Scripts/activate
pip install pyspark==3.1.3
Download the AWS JARs into PySpark's JAR directory:
cd .venv/Lib/site-packages/pyspark/jars
ls -l | grep hadoop
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
Download winutils:
cd C:/hadoop/bin
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/winutils.exe
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/hadoop.dll
Testing
To verify your setup, try running the following script.
import pyspark
spark = (pyspark.sql.SparkSession.builder
.appName('my_app')
.master('local[*]')
.config('spark.hadoop.fs.s3a.access.key', 'secret')
.config('spark.hadoop.fs.s3a.secret.key', 'secret')
.getOrCreate())
# Test reading from S3.
df = spark.read.csv('s3a://my-bucket/path/to/input/file.csv')
print(df.head(3))
# Test writing to S3.
df.write.csv('s3a://my-bucket/path/to/output')
You'll need to substitute your AWS keys and S3 paths, accordingly.
If you recently updated your OS environment variables, e.g. HADOOP_HOME and PATH, you might need to close and re-open VSCode to reflect that.

Conflicting jars while using Unirest on CDH

I'm trying to use Unirest to send a POST request from a MapReduce job on a Cloudera Hadoop 5.2.1 cluster.
One of Unirest's dependencies is httpcore-4.3.3.jar. The CDH package includes httpcore-4.2.5.jar in the classpath. While trying to run my code, I got a "ClassNotFound" exception.
I added a line in my code to check where it's getting a different class from and the answer was troubling: /opt/cloudera/parcels/CDH/jars/httpcore-4.2.5.jar.
I've looked everywhere online and tried everything I found. Needless to say, nothing seems to work.
I tried setting HADOOP_CLASSPATH environment variable, I tried setting HADOOP_USER_CLASSPATH_FIRST, and I tried using the -libjars parameter in the hadoop jar command.
Anyone have any idea how to solve this?

Hive doesn't recognize jar

I started working with Hive just recently, so I may be a little new to this, I compiled a jar using Maven Build and for some reason when I am trying to add it in the hive, it won't work. I get the following error:
Query returned non-zero code: 1, cause: ex-0.0.0.1-SNAPSHOT.jar does not exist.
I uploaded the file using hue, and I can find it if I do dfs -ls in hive.
What am I missing? (I was able to load a jar I got online)
Thanks!
If you can find your jar by -lsing to it and it was properly built, usually this error is cause by incorrectly putting quotes around the path to the jar.
Incorrect:
add jar '/root/complete/path/to/jar.jar';
Correct:
add jar /root/complete/path/to/jar.jar;

How to build and execute examples in Mahout in Action

I am learning Mahout in Action now and writing to ask how to build and execute examples in the book. I can find instructions with eclipse, but my environment doesn't include UI. So I copied the first example (RecommenderIntro) to RecommenderIntro.java and compile it through javac.
I got an error because the package was not imported. So I am looking for :
Approaches to import missing packages.
I guess, even it compiles successfully, .class file will be generated,
how can I execute it? through "java RecommnderIntro"? I can execute
mahout examples through sudo -u hdfs hadoop jar
mahout-examples-0.7-cdh4.2.0-job.jar
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job, how can I
do something similar for my own example?
All my data is saved in HBase tables, but in the book (and even
google), I cannot find a way to integrate it with HBase, any
suggestions?
q1 and q2, you need a java build tool like maven.
You build the hadoop-jar with : 'mvn clean install' This creates your hadoop job in target/mia-job.jar
You then execute your job with:
hadoop jar target/mia-job.jar RecommenderIntro inputDirIgnored outputDirIgnored
(The RecommenderIntro ignores parameters, but hadoop forces you to specify at least 2 parameters usually the input and output dir )
q3: You can't out-of-the-box.
Option1: export your hbase data to a text file 'intro.csv' with content like: "%userId%, %ItemId%, %score%" as described in the book. Because that's the file the RecommenderIntro is looking for.
Option2: Modify the example code to read data from hbase...
ps1. for developing such an application I'd really advise using an IDE. Because it allows you to use code-completion, execute, build, etc. A simple way to get started is to download a virtual image with hadoop like Cloudera or HortonWorks and install an IDE like eclipse. You can also configure these images to use your hadoop cluster, but you dont need to for small data sets.
ps2. The RecommenderIntro code isn't a distributed implementation and thus can't run on large datasets. It also runs locally instead of on a hadoop cluster.

Hadoop and PiggyBank incompatibility

I am trying to use org.apache.pig.piggybank.storage.MultiStorage from piggybank.jar archive. I downloaded pig trunk and built piggybank.jar by following the instructions here. However, I get the error below when I use the MultiStorage class.
Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
By looking here, it looks like there is a version incompatibility between the piggybank build and the hadoop version. But I am not able to fix this issue. I really appreciate any help on this (spent inordinate amount of time on this already).
pig.hadoop.version: 2.0.0-cdh4.1.0
> hadoop version
Hadoop 2.0.0-cdh4.1.0 Subversion
file:///data/1/jenkins/workspace/generic-package-ubuntu64-10-04/CDH4.1.0-Packaging-Hadoop-2012-09-29_10-56-25/hadoop-2.0.0+541-1.cdh4.1.0.p0.27~lucid/src/hadoop-common-project/hadoop-common -r 5c0a0bddbc2aaff30a8624b5980cd4a2e1b68d18 Compiled by jenkins on Sat Sep 29 11:26:31 PDT 2012 From source with checksum
95f5c7f30b4030f1f327758e7b2bd61f
Though I am not able to figure out how to build a compatible piggybank.jar, I found that a compatible piggybank.jar is located under /usr/lib/pig/.
I faced a similar issue when I used piggybank version 0.13 with Hadoop version Hadoop 2.4.0.2.1.5.0-695. It however worked when I used the piggybank jar in the location you mentioned -- /usr/lib/pig.
The additional observation I made is the piggybank jar in /usr/lib/pig is quite old and does not have XPath and other functions available. I believe new piggy jar has dependencies on later Hadoop version.

Resources