Hadoop + Hbase compatibility issues - hadoop

I searched a lot about the following issue that am facing:
java.io.IOException: Call to /10.0.1.37:50070 failed on local
exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1139)
at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) ....
I found links like: What is the meaning of EOF exceptions in hadoop namenode connections from hbase/filesystem? and others, but none of them worked for me.
Now I am starting to feel that I am not understanding the version compatibility issues better.
What confuses me the most is Hbase documentation about Hadoop compatibility which goes like "This version of Hbase will run only on Hadoop 0.20". What does 'this' refer to here? Do they mean 0.93-snapshot(which is at the top of the documentation)?
Finally, I am using Hadoop version 0.20.203 and Hbase 0.90.4. Can some one tell me if these versions are compatible?
Thanks in advance!!

I agree that the book gives a weird reference talking about "this version" and talks also about "0.93". To make things a bit more clear, the book currently transcends versions but lives only in trunk which is currently called 0.93 (and compile it adds -snapshot).
In any case, all HBase versions are currently compatible with all Hadoop 0.20.* be it 0.20.2 or 0.20.205.0., and the latter is the only one right now that supports appends. The version you are using, 0.20.203, does not and you can lose data if a region server dies.
Your EOF exception is probably because you didn't properly swap the Hadoop jars in your HBase lib/ folder. I answered a similar question on the mailing list yesterday EOFException in HBase 0.94 (it was mistitled 0.94, it should have been 0.90.4) which gives other clues on debugging this.
Finally, your stack trace has a weird port number in it. 50070 is the web UI, not the Namenode RPC port which by default is 9000. It could be that you are giving HBase the wrong port number.

I took inputs from links posted and it worked for me. Only I had to copy an additional guava*.jar found in $HADOOP_HOME/lib into $HBASE_HOME/lib (using hadoop-0.20.2)

Related

Apache Sqoop moved into the Attic in 2021-06

I have installed hadoop version 3.3.1 and sqoop 1.4.7 which doesn't seem compatible , I am getting depreciated API implemented error while importing rdbms table.
As I tried to google for compatible versions I found apache sqoop is moved to appache attiq .and version 1.4.7 which is last stable version states in its documentation says that " Sqoop is currently supporting 4 major Hadoop releases - 0.20, 0.23, 1.0 and 2.0. "
Would you please explain what does it mean and what should I do.
could you please suggest What are the alternatives of SQOOP .
It means just what the board minutes say: Sqoop has become inactive and is now moved to the Apache Attic. This doesn't mean Sqoop is deprecated in favor of some other project, but for practical purposes you should probably not build new implementations using it.
Much of the same functionality is available in other tools, including other Apache projects. Possible options are Spark, Kafka, Flume. Which one to use is very dependent on the specifics of your use case, since none of these quite fill the same niche as Sqoop. The database connectivity capabilities of Spark make it the most flexible solution, but it also could be the most labor-intensive to set up. Kafka might work, although it's not quite as ad-hoc friendly as Sqoop (take a look at Kafka Connect). I probably wouldn't use Flume, but it might be worth a look (it is mainly meant for shipping logs).

Druid + Hadoop (for both uses, deep-store & indexing)

If I have Hadoop server (pseudo-distributed mode) running on a separate machine, do I still need to have these files under my Druid's conf dir ? : http://druid.io/docs/latest/configuration/hadoop.html
The way I see it:
Looks like those -site.xml files are for Hadoop server..., and Druid only acts as Hadoop client. So I don't think Druid needs the hdfs-site.xml.
Core-site.xml..., ok, I can get it. I mean, Druid nees to know the IP of the name node (hadoop).
Mapred-site.xml, partially. Druid needs to know the status of mapreduce jobs (I suppose it will delegate the indexing to Hadoop as MR job). So it needs to communicate with those job trackers to see if the indexing is finished / failed / in progress. For that, it needs the URL of Hadoop JT.
However Druid does not need this prperty "mapreduce.cluster.local.dir", because it does not participate actively in MR job.
Yarn-site.xml? Maybe it should stay, partially. At least for submitting a job (?).
What about HDFS-site.xml? I think this can be scrapped completely.
Capacity-scheduler.xml? It can go.
Please correct me If I'm wrong.
These questions / doubts arises because I'm quite new to hadoop. I have my hadoop setup running. Pseudo distributed mode. I also tested it with javascript webhdfs library to write and read file. Also have tried the sample MR jobs provided by the hadoop dist. So I guess my hadoop setup is fine. I'm just a bit unsure on the Druid site, partly because the doc is not ver clear about it.
Btw.... I have hadoop 2.7.2... While the hadoop-client libs used by Druid is still on 2.3.0.
Should I downgrade my hadoop server to 2.3.0?
http://druid.io/docs/latest/operations/other-hadoop.html
Thansk,
Raka
Please add the mapred-site.xml core-site.xml hdfs-site.xml yarn-site.xml to the classpath.
Also you don't need to downgrade druid works well with 2.7.X.
As you can see in the doc you can use multiple version of hadoop.

Cascading HBase Tap

I am trying to write Scalding jobs which have to connect to HBase, but I have trouble using the HBase tap. I have tried using the tap provided by Twitter Maple, following this example project, but it seems that there is some incompatibility between the Hadoop/HBase version that I am using and the one that was used as client by Twitter.
My cluster is running Cloudera CDH4 with HBase 0.92 and Hadoop 2.0.0-cdh4.1.3. Whenever I launch a Scalding job connecting to HBase, I get the exception
java.lang.NoSuchMethodError: org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream;
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:363)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1046)
...
It seems that the HBase client used by Twitter Maple is expecting some method on NetUtils that does not exist on the version of Hadoop deployed on my cluster.
How do I track down what exactly is the mismatch - what version would the HBase client expect and so on? Is there in general a way to mitigate these issues?
It seems to me that often client libraries are compiled with hardcoded version of the Hadoop dependencies, and it is hard to make those match the actual versions deployed.
The method actually exists but has changed its signature. Basically, it boils down to having different versions of Hadoop libraries on your client and server. If your server is running Cloudera, you should be using the HBase and Hadoop libraries from Cloudera. If you're using Maven, you can use Cloudera's Maven repository.
It seems like library dependencies are handled in Build.scala. I haven't used Scala yet, so I'm not entirely sure how to fix it there.
The change that broke compatibility was committed as part of HADOOP-8350. Take a look at Ted Yu's comments and the responses. He works on HBase and had the same issue. Later versions of the HBase libraries should automatically handle this issue, according to his comment.

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.
There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).
About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)
Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.
It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP
You can now also try to use Talend, which includes components for Hadoop integration.
you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS
You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

Any tested Frameworks/Solutions similar to Apache Hadoop?

I am interested in the Apache Hadoop project, but i would like to know if any other tested (please mind the 'tested') projects/frameworks are out there.
Appreciate any information/links to projects similar to Apache Hadoop and any comments on the Apache Hadoop project from anyone that has used it.
Regards,
As mentioned in an answer to this question:
https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
MongoDB might be something you could look at. Its a scalable database which allows MapReduce algorithms to be run against it.
There are indeed open-source projects utilizing and funding on Hadoop.
See Apache Mahout for data mining: http://lucene.apache.org/mahout/
And are you aware of the other MR implementations available?
http://en.wikipedia.org/wiki/MapReduce#Implementations
Maybe. But none of them will have anywhere near the testing a real world experience that hadoop does. Companies like facebook and yahoo are paying to scale hadoop and I know of no similar open source projects that are really worth looking at.
A possible way is to use org.apache.hadoop.hbase.MiniDFSCluster and org.apache.hadoop.mapred.MiniMRCluster, which are used in testing hadoop itself.
What they do is to launch a small cluster locally. To test your program, make hdfs-site.xml stuffs pointing to local cluster, and add them to your classpath. And this local cluster is just like another cluster but smaller. You can reference hadoop/src/test/*-site.xml as templates.
For more example, take a look at hadoop/src/test/.
There is a Hadoop-like framework, built over Hadoop, giving importance to prioritized execution of iterative algorithms.
It is tested. I have run The WordCount example on it. It is very very similar to Hadoop (especially the installation)
You can find the paper here :
http://rio.ecs.umass.edu/mnilpub/papers/socc11-zhang.pdf
and the code here
https://code.google.com/p/priter/
Hope this helps
A

Resources