Cascading HBase Tap - hadoop

I am trying to write Scalding jobs which have to connect to HBase, but I have trouble using the HBase tap. I have tried using the tap provided by Twitter Maple, following this example project, but it seems that there is some incompatibility between the Hadoop/HBase version that I am using and the one that was used as client by Twitter.
My cluster is running Cloudera CDH4 with HBase 0.92 and Hadoop 2.0.0-cdh4.1.3. Whenever I launch a Scalding job connecting to HBase, I get the exception
java.lang.NoSuchMethodError: org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream;
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:363)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1046)
...
It seems that the HBase client used by Twitter Maple is expecting some method on NetUtils that does not exist on the version of Hadoop deployed on my cluster.
How do I track down what exactly is the mismatch - what version would the HBase client expect and so on? Is there in general a way to mitigate these issues?
It seems to me that often client libraries are compiled with hardcoded version of the Hadoop dependencies, and it is hard to make those match the actual versions deployed.

The method actually exists but has changed its signature. Basically, it boils down to having different versions of Hadoop libraries on your client and server. If your server is running Cloudera, you should be using the HBase and Hadoop libraries from Cloudera. If you're using Maven, you can use Cloudera's Maven repository.
It seems like library dependencies are handled in Build.scala. I haven't used Scala yet, so I'm not entirely sure how to fix it there.
The change that broke compatibility was committed as part of HADOOP-8350. Take a look at Ted Yu's comments and the responses. He works on HBase and had the same issue. Later versions of the HBase libraries should automatically handle this issue, according to his comment.

Related

Apache Sqoop moved into the Attic in 2021-06

I have installed hadoop version 3.3.1 and sqoop 1.4.7 which doesn't seem compatible , I am getting depreciated API implemented error while importing rdbms table.
As I tried to google for compatible versions I found apache sqoop is moved to appache attiq .and version 1.4.7 which is last stable version states in its documentation says that " Sqoop is currently supporting 4 major Hadoop releases - 0.20, 0.23, 1.0 and 2.0. "
Would you please explain what does it mean and what should I do.
could you please suggest What are the alternatives of SQOOP .
It means just what the board minutes say: Sqoop has become inactive and is now moved to the Apache Attic. This doesn't mean Sqoop is deprecated in favor of some other project, but for practical purposes you should probably not build new implementations using it.
Much of the same functionality is available in other tools, including other Apache projects. Possible options are Spark, Kafka, Flume. Which one to use is very dependent on the specifics of your use case, since none of these quite fill the same niche as Sqoop. The database connectivity capabilities of Spark make it the most flexible solution, but it also could be the most labor-intensive to set up. Kafka might work, although it's not quite as ad-hoc friendly as Sqoop (take a look at Kafka Connect). I probably wouldn't use Flume, but it might be worth a look (it is mainly meant for shipping logs).

HBase client - server’s version compatibility

I wonder how can I know if my HBase client’s jar fit to my HBase server’s version. Is there any place where it is specified which HBase versions are supported with an HBase client jar?
In my case I want to use the newest HBase client jar (2.4.5) with a pretty old HBase server (version 1.2). Is there any place where I can check the compatibility to know if it’s possible and supported?
I’d like to know if there’s a table that shows the wide compatibility like other databases has. Something like:
https://docs.mongodb.com/drivers/java/sync/current/compatibility/
Perhaps you can use checkcompatibility.py script provided in HBase distro itself to generate client API compatibility report between 1.2 and 2.4. Haven't used 2.4 myself, but based on prior history I wouldn't hope there is no breaking changes across two different major versions.

Which version of Spark to download?

I understand you can download Spark source code (1.5.1), or prebuilt binaries for various versions of Hadoop. As of Oct 2015, the Spark webpage http://spark.apache.org/downloads.html has prebuilt binaries against Hadoop 2.6+, 2.4+, 2.3, and 1.X.
I'm not sure what version to download.
I want to run a Spark cluster in standalone mode using AWS machines.
<EDIT>
I will be running a 24/7 streaming process. My data will be coming from a Kafka stream. I thought about using spark-ec2, but since I already have persistent ec2 machines, I thought I might as well use them.
My understanding is that since my persistent workers need to perform checkpoint(), it needs to have access to some kind of shared file system with the master node. S3 seems like a logical choice.
</EDIT>
This means I need to access S3, but not hdfs. I do not have Hadoop installed.
I got a pre-built Spark for Hadoop 2.6. I can run it in local mode, such as the wordcount example. However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
<EDIT>
It's not a show stopper but I want to make sure I understand the reason of this warning message. I was under the assumption that Spark doesn't need Hadoop, so why is it even showing up?
</EDIT>
I'm not sure what version to download.
This consideration will also be guided by what existing code you are using, features you require, and bug tolerance.
I want to run a Spark cluster in standalone mode using AWS instances.
Have you considered simply running Apache Spark on Amazon EMR? See also How can I run Spark on a cluster? from Spark's FAQ, and their reference to their EC2 scripts.
This means I need to access S3, but not hdfs
One does not imply the other. You can run a Spark cluster on EC2 instances perfectly fine, and never have to access S3. While many examples are written using S3 access through the out-of-the-box S3 "fs" drivers for the Hadoop library, pay attention that there are now 3 different access methods. Configure as appropriate.
However, your choice of libraries to load will depend on where your data is. Spark can access any filesystem supported by Hadoop, from which there are several to choose.
Is your data even in files? Depending on your application, and where your data is, you may only need to use Data Frame over SQL, Cassandra, or others!
However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
Not a problem. It is telling you that it is falling back to a non-optimum implementation. Others have asked this question, too.
In general, it sounds like you don't have any application needs right now, so you don't have any dependencies. Dependencies are what would drive different configurations such as access to S3, HDFS, etc.
I can run it in local mode, such as the wordcount example.
So, you're good?
UPDATE
I've edited the original post
My data will be coming from a Kafka stream. ... My understanding is that .. my persistent workers need to perform checkpoint().
Yes, the Direct Kafka approach is available from Spark 1.3 on, and per that article, uses checkpoints. These require a "fault-tolerant, reliable file system (e.g., HDFS, S3, etc.)". See the Spark Streaming + Kafka Integration Guide for your version for specific caveats.
So why [do I see the Hadoop warning message]?
The Spark download only comes with so many Hadoop client libraries. With a fully-configured Hadoop installation, there are also platform-specific native binaries for certain packages. These get used if available. To use them, augment Spark's classpath; otherwise, the loader will fallback to less performant versions.
Depending on your configuration, you may be able to take advantage of a fully configured Hadoop or HDFS installation. You mention taking advantage of your existing, persistent EC2 instances, rather than using something new. There's a tradeoff between S3 and HDFS: S3 is a new resource (more cost) but survives when your instance is offline (can take compute down and have persisted storage); however, S3 might suffer from latency compared to HDFS (you already have the machines, why not run a filesystem over them?), as well as not behave like a filesystem in all cases. This tradeoff is described by Microsoft for choosing Azure storage vs. HDFS, for example, when using HDInsight.
We're also running Spark on EC2 against S3 (via the s3n file system). We had some issue with the pre-built versions for Hadoop 2.x. Regrettably I don't remember what the issue was. But in the end we're running with the pre-built Spark for Hadoop 1.x and it works great.

How To Refresh/Clear the DistributedCache When Using Hue + Beeswax To Run Hive Queries That Define Custom UDFs?

I've set up a Hadoop cluster (using the Cloudera distro through Cloudera Manager) and I'm running some Hive queries using the Hue interface, which uses Beeswax underneath.
All my queries run fine and I have even successfully deployed a custom UDF.
But, while deploying the UDF, I ran into a very frustrating versioning issue. In the initial version of my UDF class, I used a 3rd party class that was causing a StackOverflowError.
I fixed this error and then verified that the UDF can be deployed and used successfully from the hive command line.
Then, when I went back to using Hue and Beeswax again, I kept getting the same error. I could fix this only by changing my UDF java class name. (From Lower to Lower2).
Now, my question is, what is the proper way to deal with these kind of version issues?
From what I understand, when I add jars using the handy form fields to the left, they get added to the distributed cache. So, how do I refresh/clear the distributed cache? (I couldn't get LIST JARS; etc. to run from within Hive / Beeswax. It gives me a syntax error.)
Since the classes are loaded onto the Beeswax Server JVM (same goes with HiveServer1 and HiveServer2 JVMs), deploying a new version of a jar could often require restarting these service to avoid such class loading issues.

Hadoop + Hbase compatibility issues

I searched a lot about the following issue that am facing:
java.io.IOException: Call to /10.0.1.37:50070 failed on local
exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1139)
at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) ....
I found links like: What is the meaning of EOF exceptions in hadoop namenode connections from hbase/filesystem? and others, but none of them worked for me.
Now I am starting to feel that I am not understanding the version compatibility issues better.
What confuses me the most is Hbase documentation about Hadoop compatibility which goes like "This version of Hbase will run only on Hadoop 0.20". What does 'this' refer to here? Do they mean 0.93-snapshot(which is at the top of the documentation)?
Finally, I am using Hadoop version 0.20.203 and Hbase 0.90.4. Can some one tell me if these versions are compatible?
Thanks in advance!!
I agree that the book gives a weird reference talking about "this version" and talks also about "0.93". To make things a bit more clear, the book currently transcends versions but lives only in trunk which is currently called 0.93 (and compile it adds -snapshot).
In any case, all HBase versions are currently compatible with all Hadoop 0.20.* be it 0.20.2 or 0.20.205.0., and the latter is the only one right now that supports appends. The version you are using, 0.20.203, does not and you can lose data if a region server dies.
Your EOF exception is probably because you didn't properly swap the Hadoop jars in your HBase lib/ folder. I answered a similar question on the mailing list yesterday EOFException in HBase 0.94 (it was mistitled 0.94, it should have been 0.90.4) which gives other clues on debugging this.
Finally, your stack trace has a weird port number in it. 50070 is the web UI, not the Namenode RPC port which by default is 9000. It could be that you are giving HBase the wrong port number.
I took inputs from links posted and it worked for me. Only I had to copy an additional guava*.jar found in $HADOOP_HOME/lib into $HBASE_HOME/lib (using hadoop-0.20.2)

Resources