Writing HDFS data to external disk/tape - hadoop

I have 1TB of data on HDFS. Idont have that much space on my local disk to get that data to my local.
Is there any way that i can write the HDFS data directly to a external hard disk?

If the disk is mapped on your machine, you should be able to do it using -get command

The external drive that you attached is yet another local drive which is an extension to fixed hard drive(s). So, you could use copyToLocal option of the 'hadoop fs' command from the command line.
Here is the link to for the details
http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#copyToLocal
Additionally, Hadoop APIs can be used to copy a HDFS file to local drive. Refer the copyToLocalFile() method below.
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#copyToLocalFile

Related

Integrate local HDFS filesystem browser with IntelliJ IDEA

I studied MapReduce paradigm using the HDFS cluster of my university, accessing to it by HUE. From HUE I am able to browse files, read/edit them and so on.
So in that cluster I need:
a normal folder where I put the MapReduce.jar
the access to the results in the HDFS
I like very much write MapReduce applications so I have configured correctly a local HDFS as personal playground but for now I can access to it only thorough really time-wasting command line (such as those).
I can access "directly" to the HDFS of my thorough IntelliJ IDEA by the mean of SFTP remote host connection, following is the "user normal folder":
And here is the HDFS from HUE from which I get the results:
Obviously in my local machine the "normal user folder" is where I am with the shell, but I can browse HDFS to get results only by command line.
I wish I could do such a thing even for local HDFS. Following is the best I could do:
I know that it is possible to access HDFS by http://localhost:50070/explorer.html#/ but it is very terrible.
I looked for some plugins, but I did not find anything useful. Using the command line in the long run becomes tiring.
I can access "directly" to the HDFS of my thorough IntelliJ IDEA by the mean of SFTP remote host ...
Following is the best I could do...
Neither of those are HDFS.
Is the user folder of the machine you SSH'd to
Is only the NameNode data directory on your local machine
Hue uses WebHDFS, and connects through http://namenode:50070
What you would need is a plugin that can connect to the same API, which is not over SSH, or a simple file mount.
If you wanted a file mount, you need to setup an NFS Gateway, and you mount the NFS drive like any other network attached storage.
In Production environments, you would write your code, push it to Github, then Jenkins (for example) would build the code and push it to HDFS for you.

Hbase export not going to copy to local file system

I have some data in hbase table. I have to take its backup. I am using 0.94.18 version. Now I have used following command for export.
hbase org.apache.hadoop.hbase.mapreduce.Driver export hbasetable /home/user/backup/
Now what happened actually is that data is copied to hdfs with exactly same path as I given. I am expecting this should copy to my local file system, but its not.
Where is the problem ?
Second how to backup table schema also in hbase?
For the first part of your answer, take a look at How to copy Hbase data to local file system (external drive)
Since the data is in hadoop, you just need to copy from hadoop to local system.
As for the second par, the good old docs do the tricks: http://hbase.apache.org/0.94/book/ops.backup.html
Basically they are telling two solutions: either do the backup with the system offline, or use another cluster to hold a backup of your live system.

Pull a file from remote location (local file system in some remote machine) into Hadoop HDFS

I have files in a machine (say A) which is not part of the Hadoop (OR HDFS) datacenter. So machine A is at remote location from HDFS datacenter.
Is there a script OR command OR program OR tool that can run in machines which are connected to Hadoop (part of the datacenter) and pull-in the file from machine A to HDFS directly ? If yes, what is the best and fastest way to do this ?
I know there are many ways like WebHDFS, Talend but they need to run from Machine A and requirement is to avoid that and run it in machines in datacenter.
There are two ways to achieve this:
You can pull the data using scp and store it in a temporary location, then copy it to hdfs, and delete the temporarily stored data.
If you do not want to keep it as a 2-step process, you can write a program which will read the files from the remote machine, and write it to HDFS directly.
This question along with comments and answers would come in handy for reading the file while, you can use the below snippet to write to HDFS.
outFile = <Path to the the file including name of the new file> //e.g. hdfs://localhost:<port>/foo/bar/baz.txt
FileSystem hdfs =FileSystem.get(new URI("hdfs://<NameNode Host>:<port>"), new Configuration());
Path newFilePath=new Path(outFile);
FSDataOutputStream out = hdfs.create(outFile);
// put in a while loop here which would read until EOF and write to the file using below statement
out.write(buffer);
Let buffer = 50 * 1024, if you have enough IO capicity depending on processor or you could use a much lower value like 10 * 1024 or something
Please tell me if I am getting your Question right way.
1-you want to copy the file in a remote location.
2- client machine is not a part of Hadoop cluster.
3- It is may not contains the required libraries for Hadoop.
Best way is webHDFS i.e. Rest API

Accesing local filesystem without uploading to hdfs

Is there anyway to specify the inputpath in Hadoop outside the HDFS, I am running a single node cluster and want to access files outside the HDFS, so is there any way to do this???
Yes. Just give the complete path of your file on the local FS. Don't forget to add "file://". To be on the safer side, don't add reference to the config file in your code, if you have done so.

How to Use third party API in hadoop to read files from hdfs if those API uses only local file system path?

I have large mbox files and I am using third party API like mstor to parse messages from mbox file using hadoop. I have uploaded those files in hdfs. But the problem is that this API uses only local file system path , similar to shown below
MessageStoreApi store = new MessageStoreApi(“file location in locl file system”);
I could not find a constructor in this API that would initialize from stream . So I cannot read hdfs stream and initialize it.
Now my question is, should I copy my files from hdfs to local file system and initialize it from local temporary folder? As thats what I have been doing for now:
Currently My Map function receives path of the mbox files.
Map(key=path_of_mbox_file in_hdfs, value=null){
String local_temp_file = CopyToLocalFile(path in hdfs);
MessageStoreApi store = new MessageStoreApi(“local_temp_file”);
//process file
}
Or Is there some other solution? I am expecting some solution like what If I increase the block-size so that single file fits in one block and somehow if I can get the location of those blocks in my map function, as mostly map functions will execute on the same node where those blocks are stored then I may not have to always download to local file system? But I am not sure if that will always work :)
Suggestions , comments are welcome!
For local filesystem path-like access, HDFS offers two options: HDFS NFS (via NFSv3 mounts) and FUSE-mounted HDFS.
The former is documented under the Apache Hadoop docs (CDH users may follow this instead)
The latter is documented at the Apache Hadoop wiki (CDH users may find relevant docs here instead)
The NFS feature is more maintained upstream than the FUSE option, currently.

Resources