Is it possible to use DistCp to copy only files that match a certain pattern?
For example. For /foo I only want *.log files.
I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:
distcp does not support wildcards. The closest you can do is to:
Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:
hadoop dfs -lsr hdfs://localhost:9000/path/to/source/dir/
| grep -e webapp.log.3. | awk '{print "hdfs\://localhost\:9000/" $8'} > input-files.txt
Put the input-files list into hdfs
hadoop dfs -put input-files.txt .
Create the target dir
hadoop dfs -mkdir hdfs://localhost:9000/path/to/target/
Run distcp using the input-files list and specifying the target hdfs dir:
hadoop distcp -i -f input-files.txt hdfs://localhost:9000/path/to/target/
DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use foo/*.log and that should suffice. You can experiment with hadoop fs -ls statement here - if globbing works with fs -ls, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).
Related
I have a file that has a square bracket in its name. This file needs to be uploaded to Hadoop via hadoop fs -put. I am using MapR 6.
The following variants lead to a put: unexpected URISyntaxException
hadoop fs -put aaa[bbb.txt /destination
hadoop fs -put aaa\[bbb.txt /destination
hadoop fs -put "aaa[bbb.txt" /destination
hadoop fs -put "aaa\[bbb.txt" /destination
did you tried "aaa%5Bbbb.txt" ?
Hadoop commands such as hadoop fs -put generally do a bad job with escaping names.
That is the bad news.
The good news is that with MapR, you can avoid all of that and simply copy the file to a local mount of the MapR file system using standard Linux commands like cp. There is no need to "upload" anything because MapR feels and acts just like an ordinary file system. You can get the required mount using NFS or the POSIX drivers.
The big benefit of this is that you get the benefit of the maturity of the implementations of the Linux commands. That is, those commands (and the shell) do quoting correctly and you can get the result you want relatively trivially. Just use single quotes and be done with it.
I have an 8.8G file on the hadoop cluster that I'm trying to extract certain lines for testing purpose.
Seeing that Apache Hadoop 2.6.0 have no split command, how am I able to do it without having to download the file.
If the file was on a linux server I would've used:
$ csplit filename %2015-07-17%
The previous command works as desired, is something close to that possible on Hadoop?
You could use a combination of unix and hdfs commands.
hadoop fs -cat filename.dat | head -250 > /redirect/filename
Or if last KB of the file is suffice you could use this.
hadoop fs -tail filename.dat > /redirect/filename
I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.
It is probably a question about stream processing. But I am not able to find a elegant solution using awk.
I am running a m/r job scheduled to run once a day. But there can be multiple HDFS directories on which it needs to run. For example, 3 input directories were uploaded to HDFS for the day, so 3 m/r jobs one for each directory needs to run.
So I need a solution, where i can extract filenames from the results of:
hdfs dfs -ls /user/xxx/17-03-15*
Then iterate over the filenames, launching one m/r job for each file.
Thanks
Browsing more on the issue, I found Hadoop provides a configuration settings for this issue. Here are details.
Also, I was just having some syntax issue and this simple awk command did, what i wanted:
files=`hdfs dfs -ls /user/hduser/17-03-15* | awk {'print $8'}`
I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)
Here is the command I'm submitting (names have been changed):
**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**
It returns bash: /user/username/folder/outputdirectory/: No such file or directory
I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.
I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.
The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.
The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:
bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
-getmerge also outputs to the local file system, not HDFS
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
To concatenate all files in the folder to an output file:
hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt
If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)
Syntax :
for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done
eg:
for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done
Explanation:
So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.