Writing data to Hadoop

Writing data to Hadoop - hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.

There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.

For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).

About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)

Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.

It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP

You can now also try to use Talend, which includes components for Hadoop integration.

you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS

You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

Related

Hadoop (HDFS) - file versioning

At the given time I have user file system in my application (apache CMIS).
As it's growing bigger, I'm doubting to move to hadoop (HDFS) as we need to run some statistics on it as well.
The problem:
The current file system provides versioning of the files.
When I read about hadoop - HDFS- and file versioning, I found most of the time that I have to write this (versioning) layer myself.
Is there already something available to manage versioning of files in HDFS or do I really have to write it myself (don't want to reinvent the hot water, but don't find a proper solution either).
Answer
For full details: see comments on answer(s) below
Hadoop (HDFS) doesn't support versioning of files. You can get this functionality when you combine hadoop with (amazon) S3:
Hadoop will use S3 as the filesystem (without chuncks, but recovery will be provided by S3). This solution comes with the versioning of files that S3 provides.
Hadoop will still use YARN for the distributed processing.

Versioning is not possible with HDFS.
Instead you can use Amazon S3, which provides Versioning and is also compatible with Hadoop.

HDFS supports snapshots. I think that's as close as you can get to "versioning" with HDFS.

Hdfs put VS webhdfs

I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load.
I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance?
What is recommended to use? Can somebody explain or direct me to some good link it will be really helpful.
Below us the command I'm using
curl -i --negotiate -u: -X PUT "http://$hostname:$port/webhdfs/v1/$destination_file_location/$source_filename.temp?op=CREATE&overwrite=true"
this will redirect to a datanode address which I use in next step to write the data.

Hadoop provides several ways of accessing HDFS
All of the following support almost all features of the filesystem -
1. FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop
supports, such as Local FS, HFTP FS, S3 FS.
This needs hadoop client to be installed and involves the client to write blocks
directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems.
2. WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing
Hadoop, Advantage being language agnostic way(curl, php etc....).
WebHDFS needs access to all nodes of the cluster and when some data is
read, it is transmitted from the source node directly but **there is a overhead
of http ** (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions.
3. HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the
data will be transfered and performance wise I believe this can be
even slower but preferred when needs to pull the data from public source into a secured cluster.
So choose rightly!.. Going down the list will always be an alternative when the choice above it is not available to you.

Hadoop provides a FileSystem Shell API to support file system operations such as create, rename or delete files and directories, open, read or write file.
The FileSystem shell is a java application that uses java FileSystem class to
provide FileSystem operations. FileSystem Shell API creates RPC connection for the operations.
If the client is within the Hadoop cluster, then this is useful because it use hdfs URI scheme to connect with the hadoop distributed FileSystem and hence client makes a direct RPC connection to write data into HDFS.
This is good for applications running within the Hadoop cluster but there may be use cases where an external application needs to manipulate HDFS like it needs to create directories and write files to that directory or read the content of a file stored on HDFS. Hortonworks developed an API to support these requirements based on standard REST functionality called WebHDFS.
WebHDFS provides the REST API functionality where any external application can connect the DistributedFileSystem over HTTP connection. No matter that the external application is Java or PHP.
WebHDFS concept is based on HTTP operations like GET, PUT, POST and DELETE.
Operations like OPEN, GETFILESTATUS, LISTSTATUS are using HTTP GET, others like CREATE, MKDIRS, RENAME, SETPERMISSIONS are relying on HTTP PUT
It provides secure read-write access to HDFS over HTTP. It is basically intended
as a replacement of HFTP(read only access over HTTP) and HSFTP(read only access over HTTPS).It used webhdfs URI scheme to connect with Distributed file system.
If the client is outside the Hadoop Cluster and trying to access HDFS. WebHDFS is usefull for it.Also If you are trying to connect the two difference version of Hadoop cluster then WebHDFS is usefull as it used REST API so it is independent of MapReduce or HDFS version.

The difference between HDFS access and WebHDFS is scalability due to the design of HDFS and the fact that a HDFS client decomposes a file into splits living in different nodes. When an HDFS client access file content, under the covers it goes to the NameNode and gets a list of file splits and their physical location on a Hadoop cluster.
It then can go do DataNodes living on all those locations to fetch blocks in the splits in parallel, piping the content directly to the client.
WebHDFS is a proxy living in the HDFS cluster and it layers on HDFS, so all data needs to be streamed to the proxy before it gets relayed on to the WebHDFS client. In essence it becomes a single point of access and an IO bottleneck.

You can you traditional java client api (which is being internally used by linux commands of hdfs).
From what I have read from here.
The performance of java client and Rest based approach have similar performance.

Access hdfs from outside hadoop

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!

There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.

Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.

The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.

I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).

How to save a file in MapR HDFS using Ruby

Is there a way to save a file in HDFS using MapR distribution of Hadoop from Ruby?
Apparently, there's a Thrift API called thriftfs that makes it possible to communicate with HDFS from clients but looks like it is not bundled with MapR.

I also answered this question at http://answers.mapr.com/questions/1525/how-to-run-thriftfs-from-mapr?page=1#1528
The basic idea is that languages like Ruby don't need language specific bindings to get access to the file system of a MapR cluster. Instead, all you need to do is mount the cluster as an NFS file system and you are good to go with any file access that you can dream up. This makes scripting in a Hadoop environment vastly easier.

How to read a file from HDFS in a non-Java client

So my MR Job generates a report file, and that file needs to be able to be downloaded by an end-user who needs to click a button on a normal web reporting interface, and have it download the output. According to this O'Reilly book excerpt, there is an HTTP read-only interface. It says it's XML based, but it seems that it's simply the normal web interface intended to be viewed through a web browser, not something that can be programatically queried, listed, and downloaded. Is my only recourse to write my own servlet based interface? Or execute the hadoop cli tool?

The way to access HDFS programmatically from something other than Java is by using Trift.
There are pre-generated client classes for several languages (Java, Python, PHP, ...) included in the HDFS source tree.
See http://wiki.apache.org/hadoop/HDFS-APIs

I'm afraid you will probably have to settle with the CLI AFAIK.
Not sure if it would fit your situation, but I think it would be reasonable to have whatever script that kicks off the MR job do a hadoop dfs -get ... after job completion to a known directory that's served.
Sorry that I don't know of an easier solution.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio