Hadoop File Copy Native Java API vs WebHDFS - hadoop

I have a requirement to copy files from HDFS to local. Now, I have 2 options .
1) Either use Hadoop Native Java API (FileSystem)
or
2) Use WebHDFS [ I don;t have any issues with enabling it to my cluster]
Can someone let me know what is preferred option and why?

If you are using Java, I recommend the native Java APIs as it's more flexible and give you more control.
However, WebHDFS is better if you don't want to involve dozens of libraries required by Hadoop. It decouples your application and Hadoop. Of course, you need pay a little performance cost due to HTTP.

Related

Hadoop (HDFS) - file versioning

At the given time I have user file system in my application (apache CMIS).
As it's growing bigger, I'm doubting to move to hadoop (HDFS) as we need to run some statistics on it as well.
The problem:
The current file system provides versioning of the files.
When I read about hadoop - HDFS- and file versioning, I found most of the time that I have to write this (versioning) layer myself.
Is there already something available to manage versioning of files in HDFS or do I really have to write it myself (don't want to reinvent the hot water, but don't find a proper solution either).
Answer
For full details: see comments on answer(s) below
Hadoop (HDFS) doesn't support versioning of files. You can get this functionality when you combine hadoop with (amazon) S3:
Hadoop will use S3 as the filesystem (without chuncks, but recovery will be provided by S3). This solution comes with the versioning of files that S3 provides.
Hadoop will still use YARN for the distributed processing.
Versioning is not possible with HDFS.
Instead you can use Amazon S3, which provides Versioning and is also compatible with Hadoop.
HDFS supports snapshots. I think that's as close as you can get to "versioning" with HDFS.

Accessing Hadoop data using REST service

I am trying to update HDP architecture so data residing in Hive tables can be accessed by REST APIs. What are the best approaches how to expose data from HDP to other services?
This is my initial idea:
I am storing data in Hive tables and I want to expose some of the information through REST API therefore I thought that using HCatalog/WebHCat would be the best solution. However, I found out that it allows only to query metadata.
What are the options that I have here?
Thank you
You can very well use WebHDFS which is basically a REST Service over Hadoop.
Please see documentation below:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
The REST API gateway for the Apache Hadoop Ecosystem is called KNOX
I would check it before explore any other options. In other words, Do you have any reason to avoid using KNOX?
What version of HDP are you running?
The Knox component has been available for quite a while and manageable via Ambari.
Can you get an instance of HiveServer2 running in HTTP mode?
This would give you SQL access through J/ODBC drivers without requiring Hadoop config and binaries (other than those required for the drivers) on the client machines.

Which version of Spark to download?

I understand you can download Spark source code (1.5.1), or prebuilt binaries for various versions of Hadoop. As of Oct 2015, the Spark webpage http://spark.apache.org/downloads.html has prebuilt binaries against Hadoop 2.6+, 2.4+, 2.3, and 1.X.
I'm not sure what version to download.
I want to run a Spark cluster in standalone mode using AWS machines.
<EDIT>
I will be running a 24/7 streaming process. My data will be coming from a Kafka stream. I thought about using spark-ec2, but since I already have persistent ec2 machines, I thought I might as well use them.
My understanding is that since my persistent workers need to perform checkpoint(), it needs to have access to some kind of shared file system with the master node. S3 seems like a logical choice.
</EDIT>
This means I need to access S3, but not hdfs. I do not have Hadoop installed.
I got a pre-built Spark for Hadoop 2.6. I can run it in local mode, such as the wordcount example. However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
<EDIT>
It's not a show stopper but I want to make sure I understand the reason of this warning message. I was under the assumption that Spark doesn't need Hadoop, so why is it even showing up?
</EDIT>
I'm not sure what version to download.
This consideration will also be guided by what existing code you are using, features you require, and bug tolerance.
I want to run a Spark cluster in standalone mode using AWS instances.
Have you considered simply running Apache Spark on Amazon EMR? See also How can I run Spark on a cluster? from Spark's FAQ, and their reference to their EC2 scripts.
This means I need to access S3, but not hdfs
One does not imply the other. You can run a Spark cluster on EC2 instances perfectly fine, and never have to access S3. While many examples are written using S3 access through the out-of-the-box S3 "fs" drivers for the Hadoop library, pay attention that there are now 3 different access methods. Configure as appropriate.
However, your choice of libraries to load will depend on where your data is. Spark can access any filesystem supported by Hadoop, from which there are several to choose.
Is your data even in files? Depending on your application, and where your data is, you may only need to use Data Frame over SQL, Cassandra, or others!
However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
Not a problem. It is telling you that it is falling back to a non-optimum implementation. Others have asked this question, too.
In general, it sounds like you don't have any application needs right now, so you don't have any dependencies. Dependencies are what would drive different configurations such as access to S3, HDFS, etc.
I can run it in local mode, such as the wordcount example.
So, you're good?
UPDATE
I've edited the original post
My data will be coming from a Kafka stream. ... My understanding is that .. my persistent workers need to perform checkpoint().
Yes, the Direct Kafka approach is available from Spark 1.3 on, and per that article, uses checkpoints. These require a "fault-tolerant, reliable file system (e.g., HDFS, S3, etc.)". See the Spark Streaming + Kafka Integration Guide for your version for specific caveats.
So why [do I see the Hadoop warning message]?
The Spark download only comes with so many Hadoop client libraries. With a fully-configured Hadoop installation, there are also platform-specific native binaries for certain packages. These get used if available. To use them, augment Spark's classpath; otherwise, the loader will fallback to less performant versions.
Depending on your configuration, you may be able to take advantage of a fully configured Hadoop or HDFS installation. You mention taking advantage of your existing, persistent EC2 instances, rather than using something new. There's a tradeoff between S3 and HDFS: S3 is a new resource (more cost) but survives when your instance is offline (can take compute down and have persisted storage); however, S3 might suffer from latency compared to HDFS (you already have the machines, why not run a filesystem over them?), as well as not behave like a filesystem in all cases. This tradeoff is described by Microsoft for choosing Azure storage vs. HDFS, for example, when using HDInsight.
We're also running Spark on EC2 against S3 (via the s3n file system). We had some issue with the pre-built versions for Hadoop 2.x. Regrettably I don't remember what the issue was. But in the end we're running with the pre-built Spark for Hadoop 1.x and it works great.

Hadoop services are not getting stated if i use particular zlib library

I'm trying to use different zlib library with hadoop. When i'm using a particular library, hadoop services are not getting started. Services are starting properly if i use different library. The only difference between the libraries are, not working one using a device file to send data to kernel driver for compression. Where could be the issue ? and also where to look for the error log in hadoop.

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.
There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).
About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)
Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.
It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP
You can now also try to use Talend, which includes components for Hadoop integration.
you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS
You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

Resources