I read this question about loading Pig directories from a matched pattern, but I want to run a job that deletes in the same way. I have time-stamped directories i.e. /mydir/02-03-01, /mydir/02-03-02, /mydir/02-03-03 etc and want to delete say, 02-03-01 through 02-03-01. I tried
rmf /mydir/02-03-{01,02}/
With and without quotes to no effect. Any ideas?
below one is working for me. it should be the first command in pig script.
fs -rmr -skipTrash /user/root/mydir/02-03-{01,02};
-rmr is deprecated. you can also use this
fs -rm -r -skipTrash /user/root/mydir/02-03-{01,02,03};
Related
Does hadoop filesystem shell moving of empty directory?
Assume that I have a below directory which is empty.
hadoop fs -mv /user/abc/* /user/xyz/*
When I am executing the above command , it is giving me the error
'/user/abc/*' does not exists.
However, If I put some data inside /user/abc/* , it is getting executed successfully.
Does anyone know how to handle for empty directory?
Is there any alternative to execute above command without giving error?
hadoop fs -mv /user/abc/* /user/xyz
The destination file doesn't need to add /*
I thinks you want to rename the file.
you also can use this ->
hadoop fs -mv /user/abc /user/xyz
Because you xyz file is empty,so you don't got error.
but if you xyz file has many file,you will get error as well.
This answer should be correct I believe.
hadoop fs -mv /user/abc /user/xyz
'*' is a wild card. So it's looking for any file inside the folder. When nothing found, it returns the error.
As per the command,
When you move a file, all links to otherfiles remain intact, except when youmove it to a different file system.
Doing a quick test of the form
testfunc() {
hadoop fs -rm /test001.txt
hadoop fs -touchz /test001.txt
hadoop fs -setfattr -n trusted.testfield -v $(date +"%T") /test001.txt
hadoop fs -mv /test001.txt /tmp/.
hadoop fs -getfattr -d /tmp/test001.txt
}
testfunc()
testfunc()
resulting in output
... during second function call
mv: '/tmp/test001.txt': File exists
# file: /tmp/test001.txt
trusted.testfield="<old timestamp from first call>"
...
it seems like (unlike in linux) the hadoop fs mv command does not overwrite a destination file if already exists. Is there a way to force overwrite behavior (I suppose I could check and delete the destination each time, but something like hadoop mv -overwrite <source> <dest> would be more convenient for my purposes)?
** By the way if, I am interpreting the results incorrectly or the behavior just seems incorrect, let me know (as I had assumed that overwriting was the default behavior and am writing this question because I was surprised that it seemed not to be).
I think there is no straight option to move and overwrite files from one HDFS location to other although copying (cp command) has the option to force (using -f). From Apache Hadoop documentation (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html), it is said that Hadoop is designed to use write-once-read-many model which limited overwriting.
I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.
I have created a folder to drop the result file from a Pig process using the Store command. It works the first time, but the second time it compains that the folder already exists. What is the best practice for this situiation? Documentation is sparse on this topic.
My next step will be to rename the folder to the original file name, to reduce the impact of this. Any thoughts?
You can execute fs commands from within Pig, and should be able to delete the directory by issuing a fs -rmr command before running the STORE command:
fs -rmr dir
STORE A into 'dir' using PigStorage();
The only subtly is the fs command doesn't expect quotes around the directory name, whereas the store command does expect quotes around the directory name.
Is it possible to use DistCp to copy only files that match a certain pattern?
For example. For /foo I only want *.log files.
I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:
distcp does not support wildcards. The closest you can do is to:
Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:
hadoop dfs -lsr hdfs://localhost:9000/path/to/source/dir/
| grep -e webapp.log.3. | awk '{print "hdfs\://localhost\:9000/" $8'} > input-files.txt
Put the input-files list into hdfs
hadoop dfs -put input-files.txt .
Create the target dir
hadoop dfs -mkdir hdfs://localhost:9000/path/to/target/
Run distcp using the input-files list and specifying the target hdfs dir:
hadoop distcp -i -f input-files.txt hdfs://localhost:9000/path/to/target/
DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use foo/*.log and that should suffice. You can experiment with hadoop fs -ls statement here - if globbing works with fs -ls, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).