Copying directories in HDFS using the JAVA API - hadoop

How do I copy a directory in HDFS to another directory in HDFS?
I found the copyFromLocalFile functions that copy from the local FS to HDFS, but I want both of the source/destination to be in HDFS.
Thanks

Use distcp command.
The canonical use case for distcp is for transferring data between two HDFS clusters.
If the clusters are running identical versions of Hadoop, the hdfs scheme is
appropriate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
If you want to do it through Java code, see class org.apache.hadoop.tools.DistCp and call it appropriately.

You can try FileUtil.copy
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileUtil.html

Related

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?
For a single "small" file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because hdfs://<master node> is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
Note that GCS objects use the gs: scheme. Paths should appear the same as they do when you use gsutil.
For a "large" file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consult the DistCp documentation for details.
Consider leaving data on GCS
Finally, consider leaving your data on GCS. Because the GCS connector implements Hadoop's distributed filesystem interface, it can be used as a drop-in replacement for HDFS in most cases. Notable exceptions are when you rely on (most) atomic file/directory operations or want to use a latency-sensitive application like HBase. The Dataproc HDFS migration guide gives a good overview of data migration.

hadoop on windows, how to add D:\folder1 and E:\folder1 to hdfs?

hadoop fs -put popularNames.txt /user/hadoop/dir1/popularNames.txt
My folders are very huge, about 3 TB.
I don't want to copy the folder, I want to set the hdfs to the location. How to make it?
HDFS: Hadoop distributed file system.
You can't add a link to point to a location, because the data must be present in the HDFS(not on local). The whole point of using hadoop is distributed computation, which is possible when your data is distributed on a cluster.
hadoop fs -put had to be used to move the file from your local to the hdfs in order to use hadoop framework.

Hadoop distcp not working

I am trying to copy data from one HDFS to another HDFS. Any suggestion why 1st one works but not 2nd one?
(works)
hadoop distcp hdfs://abc.net:8020/foo/bar webhdfs://def.net:14000/bar/foo
(does not work )
hadoop distcp webhdfs://abc.net:50070/foo/bar webhdfs://def:14000/bar/foo
Thanks!
If the two cluster are running incompatible version of HDFS, then
you can use the webhdfsprotocol to distcp between them.
hadoop distcp webhdfs://namenode1:50070/source/dir webhdfs://namenode2:50070/destination/dir
NameNode URI and NameNode HTTP port should be provided in the source and destination command, if you are using webhdfs.

How to move Word and PDF documents to Hadoop HDFS?

I want to copy/upload some files from a local system (a system not in Hadoop cluster) onto Hadoop HDFS. The local system can be Windows system too.
I tried with Flume spool directory. It works fine with Text files. For other docs, the mime type is getting corrupted.
Please let me know different approaches to load a file(s) to HDFS.
hadoop fs -copyFromLocal <localsrc> URI
Check Hadoop documentation: copyFromLocal
Keep in mind, Apache Flume wasn't created to copy some files.
You can also use hadoop fs -put <localsrcpath> <hdfspath>
This is one of the alternative to copyFromLocal
In hadoop 2.0 (YARN) you can do as follows to transfer local files to HDFS:
hdfs dfs -put "localsrcpath" "hdfspath"
where hdfs is the command located in the bin directory.
Java code can do that easily. You don't require any tools for this. Check below, the piece of code that worked:
Configuration conf = new Configuration();
try {
conf.set("fs.defaultFS",<<namenode>>); //something like hdfs://server:9000 or copy from core-site.xml
FileSystem fileSystem= FileSystem.get(conf);
System.out.println("Uploading please wait...");
fileSystem.copyFromLocalFile(false, new Path(args[0]), new Path(args[1].trim()));//args[0]=C://file or dir args[1]=/imported
Prepare jar out of this and run on any OS. Keep in mind you no need to
have Hadoop running in the machine, where you are going to run this
code. If you need any help, add comments.
Don't forget to add dnsresolver line where you run this code. Open /drivers/etc/hosts (for Windows)
hadoopnamenode ip-address
slavenode ip-address
First you need to load docs from your Windows machine to linux machine using filezilla or other tool.
And then you need to use:
hadoop fs -put localsrcpath hdfspath
Following command will also work.
hadoop fs -copyFromLocal localsrcpath hdfspath

Explanation of the hadoop file system

Can any one help me understand the data storage concept of hadoop?
As I understand it, hadoop deals with fs image and data blocks, and fsimage and edit logs paths are stored hdfs-site.xml. But what about the data blocks? Can anyone help me in this? I am little bit confused where the /user and /tmp dir is actually present in the filesystem.
I used this link to set up a single node hadoop cluster: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Files are split into blocks and stored in the Hadoop Distributed File System (HDFS). Consult the HDFS module of Yahoo's Hadoop Tutorial for a description of HDFS. The directories stored in HDFS can be viewed by typing the following command into a terminal: hadoop dfs -ls
The Namenode's FSImage keeps track of which Datanode has which files. In the hdfs-site.xml file, the configuration 'dfs.data.dir' defines where the datanode stores the underlying files on the filesystem. This can be a comma separated list of directories (think multiple disks).

Resources