Apache Ignite: What are the dependencies of IgniteHadoopIgfsSecondaryFileSystem? - hadoop

I am trying to setup IGFS with Hadoop as the secondary storage. I have set my configuration as shown here but I keep getting NoClassDefFoundErrors. I have downloaded both binary distributions of Ignite and have tried building from source also but the dependencies are not included. hadoop-common-2.6.0.jar and ignite-hadoop-1.4.0.jar provided some of the dependencies but now I am getting a NoClassDefFoundError for org/apache/hadoop/mapred/JobConf which by my understanding is a deprecated class...
I have been following the instructions on the Apache Ignite website but this is as far as I've gotten.
What dependencies do I need for IgniteHadoopIgfsSecondaryFileSystem as the secondary storage?

It looks like the problem is that Ignite node does not have Hadoop libraries on the classpath. To fix that please try to do the following:
1) use "Hadoop Accelerator" edition of Ignite distribution (use -Dignite.edition=hadoop if you're building the distribution yourself).
2) Set HADOOP_HOME environment variable for the Ignite process if you're using Apache Hadoop distribution, or, if you use another distribution (HDP, Cloudera, BigTop, etc.) make sure /etc/default/hadoop file exists and has appropriate contents.
Alternatively, you can manually add necessary Hadoop dependencies to Ignite node classpath: these are dependencies of groupId "org.apache.hadoop" listed in file modules/hadoop/pom.xml . Currently they are:
hadoop-annotations
hadoop-auth
hadoop-common
hadoop-hdfs
hadoop-mapreduce-client-common
hadoop-mapreduce-client-core

If you don't want to deal with dependency management yourself - which is a real hard thing to do manually - I'd suggest you look at the projects providing orchestration and deployment services for software stacks. Check Apache Bigtop (bigtop.apache.org), that provides pre-cut linux packages for Apache Ignite, Hadoop, HDFS and pretty much anything else in this space. You can grab the latest nightly packages from our CI at http://ci.bigtop.apache.org/view/Packages/job/Bigtop-trunk-packages

Related

Setting up GeoServer on GeoMesa HBase on AWS S3

I am running GeoMesa Hbase on AWS S3. I am able to ingest / export data from inside the cluster with geomesa-hbase ingest / export but I am trying to acces the data remotely. I have installed GeoServer (on the same Master node where GeoMesa is running if that is relevant) but I have difficulty with providing GeoServer the correct JARs to acces GeoMesa. I can find the list of JARs that I should provide to GeoServer here but I am not sure how or where to collect them. I have tried using the install-hadoop.sh & install-hbase.sh shell scripts in the /opt/geomesa/bin folder to install the HBase, Hadoop and Zookeeper JARs into GeoServers’ WEB-INF/lib folder, but if I change the Hadoop, Zookeeper & Hbase version in these shell scripts to be the same as the versions running on my cluster it does not find any JARS.
I am running everything on an EMR 6.2.0 release version (which comes with Hadoop 3.2.1, Hbase 2.2.6 and Zookeeper 3.4.14). On top of the cluster I am running GeoMesa 3.0.0-m0 with GeoServer 2.17 but I have also tried GeoMesa 2.4.0 with GeoServer 2.15. I’m fine with switching in either the EMR release version or GeoMesa/Server if that makes things easier.
For posterity, the setup that worked was:
GeoMesa 3.1.1
GeoServer 2.17.3
Extract the geomesa-hbase-gs-plugin into GeoServer's WEB-INF/lib directory
Run install-dependencies.sh (without modification) from the GeoMesa binary distribution to copy jars into GeoServer's WEB-INF/lib directory
Copy the hbase-site.xml into GeoServer's WEB-INF/classes directory

Running Spark from a local IDE

I've been spending some time banging my head over trying to run a complex spark application locally in order to test quicker (without having to package and deploy to a cluster).
Some context:
This spark application interfaces with Datastax Enterprise version of Cassandra and their distributed file system, so it needs some explicit jars to be provided (not available in Maven)
These jars are available on my local machine, and to "cheese" this, I tried placing them in SPARK_HOME/jars so they would be automatically added to the classpath
I tried to do something similar with the required configuration settings by putting them in spark-defaults.conf under SPARK_HOME/conf
When building this application, we do not build an uber jar, but rather do a spark-submit on the server using --jars
The problem I'm facing, is when I run the Spark Application through my IDE, it seems like it doesn't pick up any of these additional items from the SPARK_HOME director (config or jars). I spent a few hours trying to get the config items to work and ended up setting them as System.property values in my test case before starting the spark session in order for Spark to pick them up, so the configuration settings can be ignored.
However, I do not know how to reproduce this for the vendor specific jar files. Is there an easy way I can emulate the --jars behavior that spark-submit does and some home set up my spark session with this jar value? Note: I am using in my code the following command to start a spark session:
SparkSession.builder().config(conf).getOrCreate()
Additional information, in case it helps:
The Spark version I have locally in SPARK_HOME is the same version that my code is compiling with using Maven.
I asked another question similar to this related to configs: Loading Spark Config for testing Spark Applications
When I print the SPARK_HOME environment variable in my application, I am getting the correct SPARK_HOME value, so I'm not sure why neither the configs or jar files are being picked up from here. Is it possible that when running the application from my IDE, it's not picking up the SPARK_HOME environment variable and using all defaults?
You can make use of .config(key, value) while building the SparkSession by passing "spark.jars" as the key and a comma separated list of paths to the jar like so:
SparkSession.builder().config("spark.jars", "/path/jar1.jar, /path/jar2.jar").config(conf).getOrCreate()

Hadoop confs for client application

I have a client application that uses the hadoop conf files (hadoop-site.xml and hadoop-core.xml)
I don't want to check it in on the resources folders, so I try to add it via idea.
The problem is that the hadoop Confs ignores my HADOOP_CONF_DIR and loads the default confs from the hadoop package. Any ideia ?
I'm using gradle
I end up solving it by putting the configuration files on test resources folder. So when the jar gets build it does not take it.

Where is spark loading it's jars from?

When specifying a jar at "spark.jars", and running on a standalone spark, without spark-submit. Where is the jar loaded from?
I have a Spring application that performs some spark operations on a Spark standalone running in Docker.
My application relies on various libraries such as MySQL JDBC, ElasticSearch, etc, and thus it fails running on the cluster which doesn't have them.
I assembled my jar with all its dependencies and moved it to the /jars directory in Docker. But still no luck.
13:28:42.577 [Executor task launch worker-0] INFO org.apache.spark.executor.Executor - Fetching spark://192.168.99.1:58290/jars/xdf-1.0.jar with timestamp 1499088505128
13:28:42.614 [dispatcher-event-loop-0] INFO org.apache.spark.executor.Executor - Executor is trying to kill task 0.3 in stage 1.0 (TID 7)
13:28:42.698 [Executor task launch worker-0] DEBUG org.apache.spark.network.client.TransportClient - Sending stream request for /jars/xdf-1.0.jar to /192.168.99.1:58290
13:28:42.741 [shuffle-client-7-1] DEBUG org.apache.spark.rpc.netty.NettyRpcEnv - Error downloading stream /jars/xdf-1.0.jar.
java.lang.RuntimeException: Stream '/jars/xdf-1.0.jar' was not found.
Now I noticed that it's looking for the jar on the driver host but I don't understand where it's trying to deploy it from.
Any one has an
idea where it's looking for that jar.
I figured it out. The jars are loaded from the driver node.
So, I didn't need to move my jar to the spark nodes. And I had to set the correct path to the dependency jar.
So this solved it:
spark.jars=./target/scala-2.1.1/xdf.jar
If you're essentially running a standalone application running in local mode, you will need to provide all jars on your own, as opposed to having spark-submit stage the spark run time for you. Assuming that you're using a build system such as maven or gradle, you will need to package all transitive dependencies with your application and remove any scope provided declarations.
The easiest in this case is to use assembly or maven-shade plugin to package a fat jar and then run that.
if you're running on cluster mode, you can programmatically submit your application using SparkLauncher, here's an example in scala:
import org.apache.spark.launcher.SparkLauncher
object Launcher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/user/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/user/example-assembly-1.0.jar")
.setMainClass("MySparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
Keep in mind that in Yarn mode, you will also have to provide the path to your hadoop configuration.

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics

I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStatistics class. I have hadoop 2.8.x defined in my pom.xml file but trying to test it by running it locally.
How can I make it ignore searching for that or what workaround options are there to include that class if I have to go with hadoop 2.7.3?
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
at com.ibm.cos.jdbc2DF$.main(jdbc2DF.scala:153)
at com.ibm.cos.jdbc2DF.main(jdbc2DF.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StorageStatistics
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 28 more
You can't mix bits of Hadoop and expect things to work. It's not just the close coupling between internal classes in hadoop-common and hadoop-aws, its things like the specific version of the amazon-aws SDK the hadoop-aws module was built it.
If you get ClassNotFoundException or MethodNotFoundException stack traces when trying to work with s3a:// URLs, JAR version mismatch is the likely cause.
Using the RFC2117 MUST/SHOULD/MAY terminology, here are the rules to avoid this situation:
The s3a connector is in hadoop-aws JAR; it depends on hadoop-common and the aws-sdk-shaded JARs.
all these JARs MUST be on the classpath.
All versions of the hadoop-* JARs on your classpath MUST be exactly the same version, e.g 3.3.1 everywhere, or 3.2.2. Otherwise: stack trace. Always
And they MUST be exclusively of that version; there MUST NOT be multiple versions of hadoop-common, hadoop-aws etc on the classpath. Otherwise: stack trace. Always. Usually ClassNotFoundException indicating a mismatch in hadoop-common and hadoop-aws.
The exact missing class varies across Hadoop releases: it's the first class depended on by org.apache.fs.s3a.S3AFileSystem which the classloader can't find -the exact class depends on the mismatch of JARs
The AWS SDK jar SHOULD be the huge aws-java-sdk-bundle JAR, unless you know exactly which bits of the AWS SDK stack you need *and are confident all transitive dependencies (jackson, httpclient, ...) are in your Spark distribution and compatible. Otherwise: missing classes or odd runtime issues.
There MUST NOT be any other AWS SDK jars on your classpath. Otherwise: duplicate classes and general classpath problems.
The AWS SDK version SHOULD be the one shipped. Otherwise: maybe stack trace, maybe not. Either way -you are in self-support mode or have opted to join a QE team for version testing.
The specific version of the AWS SDK you need can be determined from Maven Repository
Changing the AWS SDK versions MAY work. You get to test, and if there are compatibility problems: you get to fix. See Qualifying an AWS SDK Update for the least you should be doing.
You SHOULD use the most recent versions of Hadoop you can/Spark is tested with. Non-critical bug fixes do not get backported to old Hadoop releases, and the S3A and ABFS connectors are rapidly evolving. New releases will be better, stronger, faster. Generally
If none of this works. a bug report filed on the ASF JIRA server will get closed as WORKSFORME. Config issues aren't treated as code bugs
Finally: the ASF documentation: The S3A Connector.
Note: that link is to the latest release. If you are using an older release it will lack features. Upgrade before complaining that the s3a connector doesn't do what the documentation says it does.
I found stevel's answer above to be extremely helpful. His information inspired my write-up here. I will copy the relevant parts below. My answer is tailored to a Python/Windows context, but I suspect most points are still relevant in a JVM/Linux context.
Dependencies
This answer is intended for Python developers, so it assumes we will install Apache Spark indirectly via pip. When pip installs PySpark, it collects most dependencies automatically, as seen in .venv/Lib/site-packages/pyspark/jars. However, to enable the S3A Connector, we must track down the following dependencies manually:
JAR file: hadoop-aws
JAR file: aws-java-sdk-bundle
Executable: winutils.exe (and hadoop.dll) <-- Only needed in Windows
Constraints
Assuming we're installing Spark via pip, we can't pick the Hadoop version directly. We can only pick the PySpark version, e.g. pip install pyspark==3.1.3, which will indirectly determine the Hadoop version. For example, PySpark 3.1.3 maps to Hadoop 3.2.0.
All Hadoop JARs must have the exact same version, e.g. 3.2.0. Verify this with cd pyspark/jars && ls -l | grep hadoop. Notice that pip install pyspark automatically included some Hadoop JARs. Thus, if these Hadoop JARs are 3.2.0, then we should download hadoop-aws:3.2.0 to match.
winutils.exe must have the exact same version as Hadoop, e.g. 3.2.0. Beware, winutils releases are scarce. Thus, we must carefully pick our PySpark/Hadoop version such that a matching winutils version exists. Some PySpark/Hadoop versions do not have a corresponding winutils release, thus they cannot be used on Windows.
aws-java-sdk-bundle must be compatible with our hadoop-aws choice above. For example, hadoop-aws:3.2.0 depends on aws-java-sdk-bundle:1.11.375, which can be verified here.
Instructions
With the above constraints in mind, here is a reliable algorithm for installing PySpark with S3A support on Windows:
Find latest available version of winutils.exe here. At time of writing, it is 3.2.0. Place it at C:/hadoop/bin. Set environment variable HADOOP_HOME to C:/hadoop and (important!) add %HADOOP_HOME%/bin to PATH.
Find latest available version of PySpark that uses Hadoop version equal to above, e.g. 3.2.0. This can be determined by browsing PySpark's pom.xml file across each release tag. At time of writing, it is 3.1.3.
Find the version of aws-java-sdk-bundle that hadoop-aws requires. For example, if we're using hadoop-aws:3.2.0, then we can use this page. At time of writing, it is 1.11.375.
Create a venv and install the PySpark version from step 2.
python -m venv .venv
source .venv/Scripts/activate
pip install pyspark==3.1.3
Download the AWS JARs into PySpark's JAR directory:
cd .venv/Lib/site-packages/pyspark/jars
ls -l | grep hadoop
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar
Download winutils:
cd C:/hadoop/bin
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/winutils.exe
curl -O https://raw.githubusercontent.com/cdarlint/winutils/master/hadoop-3.2.0/bin/hadoop.dll
Testing
To verify your setup, try running the following script.
import pyspark
spark = (pyspark.sql.SparkSession.builder
.appName('my_app')
.master('local[*]')
.config('spark.hadoop.fs.s3a.access.key', 'secret')
.config('spark.hadoop.fs.s3a.secret.key', 'secret')
.getOrCreate())
# Test reading from S3.
df = spark.read.csv('s3a://my-bucket/path/to/input/file.csv')
print(df.head(3))
# Test writing to S3.
df.write.csv('s3a://my-bucket/path/to/output')
You'll need to substitute your AWS keys and S3 paths, accordingly.
If you recently updated your OS environment variables, e.g. HADOOP_HOME and PATH, you might need to close and re-open VSCode to reflect that.

Resources