Overwrite destination with hadoop fs mv? - hadoop

Doing a quick test of the form
testfunc() {
hadoop fs -rm /test001.txt
hadoop fs -touchz /test001.txt
hadoop fs -setfattr -n trusted.testfield -v $(date +"%T") /test001.txt
hadoop fs -mv /test001.txt /tmp/.
hadoop fs -getfattr -d /tmp/test001.txt
}
testfunc()
testfunc()
resulting in output
... during second function call
mv: '/tmp/test001.txt': File exists
# file: /tmp/test001.txt
trusted.testfield="<old timestamp from first call>"
...
it seems like (unlike in linux) the hadoop fs mv command does not overwrite a destination file if already exists. Is there a way to force overwrite behavior (I suppose I could check and delete the destination each time, but something like hadoop mv -overwrite <source> <dest> would be more convenient for my purposes)?
** By the way if, I am interpreting the results incorrectly or the behavior just seems incorrect, let me know (as I had assumed that overwriting was the default behavior and am writing this question because I was surprised that it seemed not to be).

I think there is no straight option to move and overwrite files from one HDFS location to other although copying (cp command) has the option to force (using -f). From Apache Hadoop documentation (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html), it is said that Hadoop is designed to use write-once-read-many model which limited overwriting.

Related

Hadoop error when outputting the grep results to a new file in a different directory

I'm trying to read the contents of a few files and using grep find the lines with the my search query and then output the results into a folder in another directory. I get an error "No such file or directory exists". I have created the folder structure and the text file.
hadoop fs -cat /Final_Dataset/c*.txt | grep 2015-01-* > /energydata/2015/01/01.txt
ERROR:
-bash: /energydata/2015/01/01.txt: No such file or directory
> /energydata/2015/01/01.txt means that the output is being redirected to a local file. hdfs fs -cat sends output to your local machine and at that point you're no longer operating within Hadoop. grep simply acts on a stream of data, it doesn't care (or know) where it came from.
You need to make sure that /energydata/2015/01/ exists locally before you run this command. You can create it with mkdir -p /energydata/2015/01/.
If you're looking to pull certain records from a file on HDFS and then re-write the new file to HDFS then I'd suggest not cat-ing the file and instead keeping the processing entirely on the cluster, by using something like Spark or Hive to transform data efficiently. Or failing that just do a hadoop dfs -put <local_path> /energydata/2015/01/01.txt.
The following CLI command worked
hadoop fs -cat /FinalDataset/c*.txt | grep 2015-01-* | hadoop fs -put - /energydata/2015/01/output.txt

Uploading file with square brackets in its name to Hadoop via hadoop fs -put

I have a file that has a square bracket in its name. This file needs to be uploaded to Hadoop via hadoop fs -put. I am using MapR 6.
The following variants lead to a put: unexpected URISyntaxException
hadoop fs -put aaa[bbb.txt /destination
hadoop fs -put aaa\[bbb.txt /destination
hadoop fs -put "aaa[bbb.txt" /destination
hadoop fs -put "aaa\[bbb.txt" /destination
did you tried "aaa%5Bbbb.txt" ?
Hadoop commands such as hadoop fs -put generally do a bad job with escaping names.
That is the bad news.
The good news is that with MapR, you can avoid all of that and simply copy the file to a local mount of the MapR file system using standard Linux commands like cp. There is no need to "upload" anything because MapR feels and acts just like an ordinary file system. You can get the required mount using NFS or the POSIX drivers.
The big benefit of this is that you get the benefit of the maturity of the implementations of the Linux commands. That is, those commands (and the shell) do quoting correctly and you can get the result you want relatively trivially. Just use single quotes and be done with it.

hdfs dfs -put with overwrite?

I am using
hdfs dfs -put myfile mypath
and for some files I get
put: 'myfile': File Exists
does that mean there is a file with the same name or does that mean the same exact file (size, content) is already there?
how can I specify an -overwrite option here?
Thanks!
put: 'myfile': File Exists
Means,the file named "myfile" already exists in hdfs. You cannot have multiple files of the same name in hdfs
You can overwrite it using hadoop fs -put -f /path_to_local /path_to_hdfs
You can overwrite your file in hdfs using -f command.For example
hadoop fs -put -f <localfile> <hdfsDir>
OR
hadoop fs -copyFromLocal -f <localfile> <hdfsDir>
It worked fine for me. However -f command won't work in case of get or copyToLocal command. check this question
A file with the same name exists at the location you're trying to write to.
You can overwrite by specifying the -f flag.
Just updates to this answer, in Hadoop 3.X the command a bit different
hdfs dfs -put -f /local/to/path hdfs://localhost:9870/users/XXX/folder/folder2

How do I remove a file from HDFS

I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.

How can I concatenate two files in hadoop into one using Hadoop FS shell?

I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)
Here is the command I'm submitting (names have been changed):
**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**
It returns bash: /user/username/folder/outputdirectory/: No such file or directory
I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.
I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.
The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.
The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:
bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
-getmerge also outputs to the local file system, not HDFS
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
To concatenate all files in the folder to an output file:
hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt
If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)
Syntax :
for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done
eg:
for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done
Explanation:
So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.

Resources