I have a directory structure with data on a local filesystem. I need to replicate it to Hadoop cluster.
For now I found three ways to do it:
using "hdfs dfs -put" command
using hdfs nfs gateway
mounting my local dir via nfs on each datanode and using distcp
Am I missing any other tools? Which one of these would be the fastest way to make a copy?
I think hdfs dfs -put or hdfs dfs -copyFromLocal would be the simplest way of doing it.
If you have a lot data (many files), you can copy them programmatically.
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/home/me/localdirectory/"), new Path("/me/hadoop/hdfsdir"));
Related
I want to make a folder in hadoop-2.7.3 that physically resides on an external (usb-thumb) drive, the idea being that any file that I -copyFromLocal, will reside on the thumb drive. Similarly any output files from hadoop also goes to the external drive:
mkdir /media/usb
mount /dev/sdb1 /media/usb
hdfs dfs -mkdir /media/usb/test
hdfs dfs -copyFromLocal /media/source/input.data /media/usb/test
hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input /media/usb/test/input.data \
-output /media/usb/test/output.data
But I get no such file/folder error when trying to make the folder above.. it works only if I make the folders local to hadoop:
hdfs dfs -mkdir /test
hdfs dfs -copyFromLocal /media/source/input.data /test
Unfortunately this places the input data file onto the same drive as the hadoop install, which is nearly full. Is there a way to make/map an HDFS folder so that it's reads/writes from an drive other than the hadoop drive?
What you are trying to do is not possible! It defies the whole idea of distributed storage and processing.
When you do a copyFromLocal the file goes from your local onto a HDFS location (which hadoop manages). You may add your new drive as a HDFS DataNode but may not mandate a file to move to it.
If space is your only constraint then add the new drive as a datanode and re-balance the cluster.
Once the new node is added and datanode service is started on it, balance the cluster using:
hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-idleiterations <idleiterations>]
Refer: HDFS Balancer
I would like to use distcp to copy a list of files (> 1K files) into hdfs. I have already stored list of files in local directory, now can I use -f to copy all files? if yes what is the format do I have to maintain in my files list file? or is there any other better way?
You don't have to use distcp if your use-case is copying data from local filesystem (say Linux) to HDFS. You can simply use hdfs dfs -put command for the same. Here is the syntax.
hdfs dfs -put /path/to/local/dir/* /path/on/hdfs/
e.g.
hdfs dfs -mkdir /user/hduser/destination-dir/
hdfs dfs -put /home/abc/mydir/* /user/hduser/destination-dir/
You have created a file containing list of file paths but that is not at all needed. It's mainly used (for distcp) when you are copying data from one cluster to other cluster
I am a beginner in hadoop. I have two doubts
1) how to access files stored in the hdfs? Is it same as using a FileReader in java.io and giving the local path or is it something else?
2) i have created a folder where i have copied the file to be stored in hdfs and the jar file of the mapreduce program. When I run the command in any directory
${HADOOP_HOME}/bin/hadoop dfs -ls
it just shows me all the files in the current dir. So does that mean all the files got added without me explicitly adding it?
Yes, it's pretty much the same. Read this post to read files from HDFS.
You should keep in mind that HDFS is different than your local file system. With hadoop dfs you access the HDFS, not the local file system. So, hadoop dfs -ls /path/in/HDFS shows you the contents of the /path/in/HDFS directory, not the local one. That's why it's the same, no matter where you run it from.
If you want to "upload" / "download" files to/from HDFS you should use the commads:
hadoop dfs -copyFromLocal /local/path /path/in/HDFS and
hadoop dfs -copyToLocal /path/in/HDFS /local/path, respectively.
I have an Elastic Map Reduce job which is writing some files in S3 and I want to concatenate all the files to produce a unique text file.
Currently I'm manually copying the folder with all the files to our HDFS (hadoop fs copyFromLocal), then I'm running hadoop fs -getmerge and hadoop fs copyToLocal to obtain the file.
is there anyway to use hadoop fs directly on S3?
Actually, this response about getmerge is incorrect. getmerge expects a local destination and will not work with S3. It throws an IOException if you try and responds with -getmerge: Wrong FS:.
Usage:
hadoop fs [generic options] -getmerge [-nl] <src> <localdst>
An easy way (if you are generating a small file that fits on the master machine) is to do the following:
Merge the file parts into a single file onto the local machine (Documentation)
hadoop fs -getmerge hdfs://[FILE] [LOCAL FILE]
Copy the result file to S3, and then delete the local file (Documentation)
hadoop dfs -moveFromLocal [LOCAL FILE] s3n://bucket/key/of/file
I haven't personally tried the getmerge command myself but hadoop fs commands on EMR cluster nodes support S3 paths just like HDFS paths. For example, you can SSH into the master node of your cluster and run:
hadoop fs -ls s3://<my_bucket>/<my_dir>/
The above command will list of out all the S3 objects under the specified directory path.
I would expect hadoop fs -getmerge to work the same way. So, just use full S3 paths (starting with s3://) instead of HDFS paths.
Can anyone let me know what seems to be wrong here ? hadoop dfs command seems to be OK but any following options are not recognized.
[hadoop-0.20]$bin/hadoop dfs -ls ~/wordcount/input/
ls: Cannot access /home/cloudera/wordcount/input/ : No such file or directory
hadoop fs -ls /some/path/here - will list a HDFS location, not your local linux location
try first this command
hadoop fs -ls /
then investigate step by step other folders.
if you want to copy some files from local directory to users directory on HDFS location, then just use this:
hadoop fs -mkdir /users
hadoop fs -put /some/local/file /users
for more hdfs commands see this: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
FS relates to a generic file system which can point to any file systems like local, HDFS, s3 etc But dfs is very specific to HDFS. So when we use FS it can perform operation with from/to local or hadoop distributed file system to destination. But specifying DFS operation relates to HDFS.