I'm sorry if this is a rather simple question, but I haven't found anything exactly online and just needed a quick answer.
I am trying to copy files from one HDFS directory to a new directory to make a backup. I was given something like this:
hadoop fs -mkdir one/two/three/dir1_bkp
hadoop fs -cp one/two/three/dir1/* one/two/three/dir1_bkp
This should only copy all of the files in dir1 to dir1_bkp and not affect anything in dir1, correct?
Copying doesn't affect the source location, no.
Depending on the size of the data, distcp might be a better option
Related
I was running hadoop distcp to copy a whole directory (500GB+) from /path/to/source to /path/to/destination. However, instead of running
$ hadoop distcp /path/to/source /path/to/destination
I did the following in mistake
$ hadoop distcp /path/to/source path/to/destination
The operation completed like a normal distcp copy, with mapreduce taking some time to run, and of course I did not get my data in /path/to/destination. It was also not in /path/to/source/path/to/destination, or other relative paths I could think of.
Where did the data go? Thanks.
It doesn't go anywhere if the destination path is not correct it stays in the source location
I need to move files written by a Hive job that look like this
/foo/0000_0
/foo/0000_1
/bar/0000_0
into a file structure that looks like this
/foo/prefix1/prefix2-0000_0
/foo/prefix1/prefix2-0000_1
/bar/prefix1/prefix2-0000_0
before migrating this out of the cluster (using s3distcp). I've been looking around hadoop fs but I can't find something that would let me do this. I don't want to rename file by file.
first, you need to create the sub directory inside /foo. For this use following command
$hdfs dfs -mkdir /foo/prefix1
this will create a sub directory in /foo. if you want to create more subdirectory inside prefix1 use this same command recursively with updated path structure.In case you are using an older version of Hadoop (1.x) replace hdfs by hadoop.
now you can move files from /foo to /foo/prefix1 using the following command.Here newfilename can be any name you want to give to your file.
$hdfs dfs -mv /foo/filename /foo/prefix1/newfilename
Hope this answer your query
I learned that if you want to copy multiple files from one hadoop folder to another hadoop folder you can better create one big 'hdfs dfs -cp' statement with lots of components, instead of creating several hdfs dfs -cp statements.
With 'better' I mean that it will improve the overal time it takes to copy files: one command is quicker than several separate -cp commands run after each other.
When I do this and my target directory is the same for all files that I want to copy I get a warning.
I'm executing the following command:
hdfs dfs -cp -f /path1/file1 /pathx/target /path2/file2 /pathx/target /path3/file3 /pathx/target
After executing it I get the following warning returned:
cp: `/pathx/target' to `/pathx/target/target': is a subdirectory of itself
Although I get this weird warning the copy itself succeeds like it should.
Is this a bug or am I missing something?
Try to use the following syntax:
hadoop fs -cp /path1/file1 /path2/file2 path3/file3 /pathx/target
Or You could do it like this:
hadoop fs -cp /path1/{file1, file2, file3} /pathx/target
If you want to copy all the files then:
hadoop fs -cp /path1/* /pathx/target
I would like to know, how does the getMerge command work in OS/HDFS level. Will it copy each and every byte/blocks from one file to another file,or just a simple file descriptor change? How costliest operation is it?
getmerge
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.
So, to answer your question,
Will it copy each and every byte/blocks from one file to another file
Yes, and no. It will find every HDFS block containing the files in the given source directory and concatenate them together into a single file on your local filesystem.
a simple file descriptor change
Not sure what you mean by that. getmerge doesn't change any file descriptors; it is just reading data from HDFS to your local filesystem.
How costliest operation is it?
Expect it to be as costly as manually cat-ing all the files in an HDFS directory. The same operation for
hadoop fs -getmerge /tmp/ /home/user/myfile
Could be achieved by doing
hadoop fs -cat /tmp/* > /home/user/myfile
The costly operation being the fetching of many file pointers and transferring those records over the network to your local disk.
I am a beginner in hadoop. I have two doubts
1) how to access files stored in the hdfs? Is it same as using a FileReader in java.io and giving the local path or is it something else?
2) i have created a folder where i have copied the file to be stored in hdfs and the jar file of the mapreduce program. When I run the command in any directory
${HADOOP_HOME}/bin/hadoop dfs -ls
it just shows me all the files in the current dir. So does that mean all the files got added without me explicitly adding it?
Yes, it's pretty much the same. Read this post to read files from HDFS.
You should keep in mind that HDFS is different than your local file system. With hadoop dfs you access the HDFS, not the local file system. So, hadoop dfs -ls /path/in/HDFS shows you the contents of the /path/in/HDFS directory, not the local one. That's why it's the same, no matter where you run it from.
If you want to "upload" / "download" files to/from HDFS you should use the commads:
hadoop dfs -copyFromLocal /local/path /path/in/HDFS and
hadoop dfs -copyToLocal /path/in/HDFS /local/path, respectively.