I want to make a folder in hadoop-2.7.3 that physically resides on an external (usb-thumb) drive, the idea being that any file that I -copyFromLocal, will reside on the thumb drive. Similarly any output files from hadoop also goes to the external drive:
mkdir /media/usb
mount /dev/sdb1 /media/usb
hdfs dfs -mkdir /media/usb/test
hdfs dfs -copyFromLocal /media/source/input.data /media/usb/test
hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input /media/usb/test/input.data \
-output /media/usb/test/output.data
But I get no such file/folder error when trying to make the folder above.. it works only if I make the folders local to hadoop:
hdfs dfs -mkdir /test
hdfs dfs -copyFromLocal /media/source/input.data /test
Unfortunately this places the input data file onto the same drive as the hadoop install, which is nearly full. Is there a way to make/map an HDFS folder so that it's reads/writes from an drive other than the hadoop drive?
What you are trying to do is not possible! It defies the whole idea of distributed storage and processing.
When you do a copyFromLocal the file goes from your local onto a HDFS location (which hadoop manages). You may add your new drive as a HDFS DataNode but may not mandate a file to move to it.
If space is your only constraint then add the new drive as a datanode and re-balance the cluster.
Once the new node is added and datanode service is started on it, balance the cluster using:
hdfs balancer
[-threshold <threshold>]
[-policy <policy>]
[-exclude [-f <hosts-file> | <comma-separated list of hosts>]]
[-include [-f <hosts-file> | <comma-separated list of hosts>]]
[-idleiterations <idleiterations>]
Refer: HDFS Balancer
Related
I have created a file in hdfs using below command
hdfs dfs -touchz /hadoop/dir1/file1.txt
I could see the created file by using below command
hdfs dfs -ls /hadoop/dir1/
But, I could not find the location itself by using linux commands (using find or locate). I searched on internet and found following link.
How to access files in Hadoop HDFS? . It says, hdfs is virtual storage. In that case, How its taking partition which one or how much it needs to be used, where the meta data being stored
Is it taking datanode location for virtual storage which I have mentioned in hdfs-site.xml to store all the data?
I looked into datanode location and there are files available. But I could not find out anything related to my file or folder which I have created.
(I am using hadoop 2.6.0)
HDFS file system is a distributed storage system wherein the storage location is virtual and created using the disk space from all the DataNodes. While installing hadoop, you must have specified paths for dfs.namenode.name.dir and dfs.datanode.data.dir. These are the locations at which all the HDFS related files are stored on individual nodes.
While storing the data onto HDFS, it is stored as blocks of a specified size (default 128MB in Hadoop 2.X). When you use hdfs dfs commands you will see the complete files but internally HDFS stores these files as blocks. If you check the above mentioned paths on your local file system, you will see a bunch of files which correcpond to files on your HDFS. But again, you will not see them as actual files as they are split into blocks.
Check below mentioned command's output to get more details on how much space from each DataNode is used to create the virtual HDFS storage.
hdfs dfsadmin -report #Or
sudo -u hdfs hdfs dfsadmin -report
HTH
As we creating a file in local file system i.e on creating a directory in it
for ex:$/mkdir MITHUN94** it is a directory entering into that(LFS) cd MITHUN90
in that **create a new file as **$nano file1.log .
And now create a directory in** hdfs for ex: hdfs dfs -mkdir /mike90 .Here "mike90"
refers to directory name . After that creating a directory send files from LFS to hdfs. By using this command $hdfs dfs -copyFromLocal /home/gopalkrishna/file1.log
/mike90
Here '/home/gopalkrishna/file1.log' refers to pwd (present working directory)
and '/mike90' refers to directory in hdfs. By clickig $hdfs dfs -ls /mike90
the list of files .
I have a directory structure with data on a local filesystem. I need to replicate it to Hadoop cluster.
For now I found three ways to do it:
using "hdfs dfs -put" command
using hdfs nfs gateway
mounting my local dir via nfs on each datanode and using distcp
Am I missing any other tools? Which one of these would be the fastest way to make a copy?
I think hdfs dfs -put or hdfs dfs -copyFromLocal would be the simplest way of doing it.
If you have a lot data (many files), you can copy them programmatically.
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/home/me/localdirectory/"), new Path("/me/hadoop/hdfsdir"));
I am a beginner in hadoop. I have two doubts
1) how to access files stored in the hdfs? Is it same as using a FileReader in java.io and giving the local path or is it something else?
2) i have created a folder where i have copied the file to be stored in hdfs and the jar file of the mapreduce program. When I run the command in any directory
${HADOOP_HOME}/bin/hadoop dfs -ls
it just shows me all the files in the current dir. So does that mean all the files got added without me explicitly adding it?
Yes, it's pretty much the same. Read this post to read files from HDFS.
You should keep in mind that HDFS is different than your local file system. With hadoop dfs you access the HDFS, not the local file system. So, hadoop dfs -ls /path/in/HDFS shows you the contents of the /path/in/HDFS directory, not the local one. That's why it's the same, no matter where you run it from.
If you want to "upload" / "download" files to/from HDFS you should use the commads:
hadoop dfs -copyFromLocal /local/path /path/in/HDFS and
hadoop dfs -copyToLocal /path/in/HDFS /local/path, respectively.
I have an Elastic Map Reduce job which is writing some files in S3 and I want to concatenate all the files to produce a unique text file.
Currently I'm manually copying the folder with all the files to our HDFS (hadoop fs copyFromLocal), then I'm running hadoop fs -getmerge and hadoop fs copyToLocal to obtain the file.
is there anyway to use hadoop fs directly on S3?
Actually, this response about getmerge is incorrect. getmerge expects a local destination and will not work with S3. It throws an IOException if you try and responds with -getmerge: Wrong FS:.
Usage:
hadoop fs [generic options] -getmerge [-nl] <src> <localdst>
An easy way (if you are generating a small file that fits on the master machine) is to do the following:
Merge the file parts into a single file onto the local machine (Documentation)
hadoop fs -getmerge hdfs://[FILE] [LOCAL FILE]
Copy the result file to S3, and then delete the local file (Documentation)
hadoop dfs -moveFromLocal [LOCAL FILE] s3n://bucket/key/of/file
I haven't personally tried the getmerge command myself but hadoop fs commands on EMR cluster nodes support S3 paths just like HDFS paths. For example, you can SSH into the master node of your cluster and run:
hadoop fs -ls s3://<my_bucket>/<my_dir>/
The above command will list of out all the S3 objects under the specified directory path.
I would expect hadoop fs -getmerge to work the same way. So, just use full S3 paths (starting with s3://) instead of HDFS paths.
Can anyone let me know what seems to be wrong here ? hadoop dfs command seems to be OK but any following options are not recognized.
[hadoop-0.20]$bin/hadoop dfs -ls ~/wordcount/input/
ls: Cannot access /home/cloudera/wordcount/input/ : No such file or directory
hadoop fs -ls /some/path/here - will list a HDFS location, not your local linux location
try first this command
hadoop fs -ls /
then investigate step by step other folders.
if you want to copy some files from local directory to users directory on HDFS location, then just use this:
hadoop fs -mkdir /users
hadoop fs -put /some/local/file /users
for more hdfs commands see this: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
FS relates to a generic file system which can point to any file systems like local, HDFS, s3 etc But dfs is very specific to HDFS. So when we use FS it can perform operation with from/to local or hadoop distributed file system to destination. But specifying DFS operation relates to HDFS.