Which libraries are needed for mapreduce on HBase? - maven

I have a very basic question! what libraries I need for MapReduce on HBase?
I know I must use TableMapper and I've got hadoop-client 2.2.0 and hbase-client 0.98.2 from maven but there's no TableMapper in API.
Thank you.

In addition to hbase-client dependency you need same version hbase-server dependency, this will also include the mapreduce libraries you need.
If you're using Maven, you need to add in your pom.xml file:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.1.2</version>
</dependency>
(At the moment of writing last version is 1.1.2)
Good luck!

before hbase 2.0,you need add the follow dependce in the pom.xml
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.3.1</version>
after hbase 2.0,
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-mapreduce</artifactId>
<version>2.0.0</version>

Related

Why does spark-submit fail to find kafka data source unless --packages is used?

I am trying to integrate Kafka in my Spark app, here is my POM file required entries:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.stream.kafka.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>${kafka.version}</version>
</dependency>
Corresponding artifact versions are:
<kafka.version>0.10.2.0</kafka.version>
<spark.stream.kafka.version>2.2.0</spark.stream.kafka.version>
I have been scratching my head over:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
I also tried supplying the jar with --jars parameter, however it is not helping. What am I missing here?
Code:
private static void startKafkaConsumerStream() {
Dataset<HttpPackage> ds1 = _spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", getProperty("kafka.bootstrap.servers"))
.option("subscribe", HTTP_FED_VO_TOPIC)
.load() // Getting the error here
.as(Encoders.bean(HttpPackage.class));
ds1.foreach((ForeachFunction<HttpPackage>) req ->System.out.print(req));
}
And _spark is defined as:
_spark = SparkSession
.builder()
.appName(_properties.getProperty("app.name"))
.config("spark.master", _properties.getProperty("master"))
.config("spark.es.nodes", _properties.getProperty("es.hosts"))
.config("spark.es.port", _properties.getProperty("es.port"))
.config("spark.es.index.auto.create", "true")
.config("es.net.http.auth.user", _properties.getProperty("es.net.http.auth.user"))
.config("es.net.http.auth.pass", _properties.getProperty("es.net.http.auth.pass"))
.getOrCreate();
My imports are:
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import org.apache.spark.api.java.function.ForeachFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
However when I run my code as mentioned here and which is with the package option:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
It works
Spark Structured Streaming supports Apache Kafka as the streaming source and sink using the external kafka-0-10-sql module.
kafka-0-10-sql module is not available to Spark applications that are submitted for execution using spark-submit. The module is external and to have it available you should define it as a dependency.
Unless you use kafka-0-10-sql module-specific code in your Spark application you don't have to define the module as a dependency in pom.xml. You simply don't need a compilation dependency on the module since no code uses the module's code. You code against interfaces which is one of the reasons why Spark SQL is so pleasant to use (i.e. it requires very little to code to have fairly sophisticated distributed application).
spark-submit however will require --packages command-line option that you've reported it worked fine.
However when I run my code as mentioned here and which is with the package option:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
The reason it worked fine with --packages is that you have to tell Spark infrastructure where to find the definition of kafka format.
That leads us to the other "issue" (or a requirement) to run streaming Spark applications with Kafka. You have to specify the runtime dependency on spark-sql-kafka module.
You specify a runtime dependency using --packages command-line option (that downloads the necessary jars after you spark-submit your Spark application) or creating a so-called uber-jar (or a fat-jar).
That's where pom.xml comes to play (and that's why people offered their help with pom.xml and the module as a dependency).
So, first of all, you have to specify the dependency in pom.xml.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
And the last but not least, you have to build an uber-jar that you configure in pom.xml using Apache Maven Shade Plugin.
With Apache Maven Shade Plugin you create an Uber JAR that will include all the "infrastructure" for kafka format to work, inside the Spark application jar file. As a matter of fact, the Uber JAR will contain all the necessary runtime dependencies and so you could spark-submit with the jar alone (and no --packages option or similar).
Add below dependency to your pom.xml file.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
Update your dependencies and versions. Below given dependencies should work fine:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.1</version>
</dependency>
PS: Note provided scope in first two dependencies.

Maven resolve dependencies issues

I want to know what is the best aproach in a Maven conflict situation like this one
I have this library
<dependency>
<groupId>io.vertx</groupId>
<artifactId>vertx-core</artifactId>
<version>3.3.3</version>
</dependency>
Which has a dependency with
netty-transport 4.15.final
And I have this other dependency
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-extras</artifactId>
<version>3.1.0</version>
</dependency>
Which contains the dependency
netty-transport 4.0.37
It´s seems like the cassandra version with the 4.15.final does not work
So normally in this cases what you have to do, if I exclude the netty
from the vertx dependency it´s most probably it wont work with the version of Cassandra, and the other way around.
Are those library with those versions condemned to do not work together?
Regards.

How can I know spark-core version?

I am using spark 1.5.2 in hdp, and version for hadoop is 2.7.1.2.3.4.7-4. When I attempt to add jars in maven pom file like this
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
I dont know where to find version for spark-core. There are 2.11 and 2.10.
any help is appreciated.
That version you are mentioning denotes which version of Scala you want to use for the spark-core.
You need to check Scala's version on your cluster to know if it's 2.10 that you need or 2.11.

Which pom dependency should I use for jar commons-lang.jar

How do I know which version of a pom dependency I should use if its version is not in the jar name. For example the jar commons-lang.jar, what version of the pom dependency should I use ?
Here are its search results on maven central repo - http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22net.sf.staccatocommons%22%20AND%20a%3A%22commons-lang%22
First, use the one from Apache.
Second, you have two options, the 2.x or 3.x branches; from searching mvnrepository.com:
2.6
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.6</version>
</dependency>
3.1
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.1</version>
</dependency>
If you're using Maven, you shouldn't have "just a jar", you should only know about POM dependencies.
(As of Feb 2014 it's up to 3.3.2, the 2.x series is still at 2.6. Note that you may use both in the same application because of their different packages.)
While the other answers are correct a very handy way to find out exact match for an unknown jar where all you have is the jar itself and it does not contain a useful manifest is to create a sha1 checksum of the jar and then do a checksum search on http://search.maven.org in the Advanced Search at the bottom or on your own instance of a Nexus repository server that downloaded the index of the Central Repository.
And btw your search on central was incorrect since it had the wrong groupId as part of it. Here is a corrected link:
http://search.maven.org/#search%7Cga%7C1%7C%22commons-lang%22
If you are migrating to Maven and just have a bunch of jars then you can try examining their META-INF/MANIFEST.MF files inside of those jars.
I've just opened commons-lang.jar and saw the following in its META-INF/MANIFEST.MF:
...
Implementation-Title: Commons Lang
Implementation-Vendor: The Apache Software Foundation
Implementation-Vendor-Id: org.apache
Implementation-Version: 2.4
...
So you can use Implementation-Version as your version in pom.xml:
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.4</version>
</dependency>

Cannot find SerialAddress class in Apache Mina 2.0.2

I added the below dependencies in my project POM file and the SerialAddress class is no where to be found from the downloaded mina-core.2.0.2.jar.
<dependency>
<groupId>org.apache.mina</groupId>
<artifactId>mina-core</artifactId>
<version>2.0.2</version>
</dependency>
the package org.apache.mina.transport.serial doesnt even exist. Please advice me on the correct Dependency.
It looks like this class is not part of mina-core. Some exploration lead to the existence of Apache Mina Serial Communication Support.
So I guess you would want to add the dependency for mina-transport-serial.
<dependency>
<groupId>org.apache.mina</groupId>
<artifactId>mina-transport-serial</artifactId>
<version>2.0.2</version>
</dependency>

Resources