Hadoop read files with following name patterns - hadoop

This may sound very basic but I have a folder in HDFS with 3 kinds of files.
eg:
access-02171990
s3.Log
catalina.out
I want my map/reduce to read only files which begin with access- only. How do I do that via program? or specifying via the input directory path?
Please help.

You can set the input path as a glob:
FileInputFormat.addInputPath(jobConf, new Path("/your/path/access*"))

Related

Multiple source files for s3distcp

Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.
I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.
Workaround that I am currently using is to tell all the file names in srcPattern
hadoop jar s3distcp.jar
--src s3n://bucket/src_folder/
--dest hdfs:///test/output/
--srcPattern '.*somefile.*|.*anotherone.*'
Can this thing work when the number of files is too many? like around 10 000?
hadoop distcp should solve your problem.
we can use distcp to copy data from s3 to hdfs.
And it also supports wildcards and we can provide multiple source paths in the command.
http://hadoop.apache.org/docs/r1.2.1/distcp.html
Go through the usage section in this particular url
Example:
consider you have the following files in s3 bucket(test-bucket) inside test1 folder.
abc.txt
abd.txt
defg.txt
And inside test2 folder you have
hijk.txt
hjikl.txt
xyz.txt
And your hdfs path is hdfs://localhost.localdomain:9000/user/test/
Then distcp command is as follows for a particular pattern.
hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/
Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here

Reading files from multiple directories in Logstash?

I read my log files (cron_log, auth_log, mail_log, etc) using this config:
file{
path => '/path/to/log/file/*_log'
}
So I read my log files and check:
if(path) ~= "cron" -----match--------
if(path) ~= "auth" -----match--------
Now I have a directories like: Server1 Server2 Server3......In Server 1 there are subdirectories: authlog cronlog.....Inside authlog there are subdirectories date wise (like 2014.05.26, 2014.05.27) which finally contain log file for the day, which I have to parse.
So presently I was having one config file which use to read files using *_log and I use to run that config file and all log files present in /path/to/log/file/*_log were parsed.
Now I have to read from many directories (as explained above).
Will I have to write separate config file for each directory??
What's the best way to achieve this using logstash??
Ruby globs interpret ** as including all subdirectories.
So, for example, you could give the file input a path such as:
/path/to/date/folders/**/*_log

Merging MapReduce output

I have two MapReduce jobs which produce files in two separate directories which look like so:
Directory output1:
------------------
/output/20140102-r-00000.txt
/output/20140102-r-00000.txt
/output/20140103-r-00000.txt
/output/20140104-r-00000.txt
Directory output2:
------------------
/output-update/20140102-r-00000.txt
I want to merge these two directories together in a new directory /output-complete/ where the 20140102-r-00000.txt replaces the original file in the /output directory and all of the "-r-0000x" is removed from the file name. The two original directories will now be empty and the resulting directory should look as follows:
Directory output3:
-------------------
/output-complete/20140102.txt
/output-complete/20140102.txt
/output-complete/20140103.txt
/output-complete/20140104.txt
What is the best way to do this? Can I use only HDFS shell commands? Do I need to create a java program to traverse both directories and do the logic?
you can use pig ...
get_data = load '/output*/20140102*.txt' using Loader()
store get_data into "/output-complete/20140102.txt"
or HDFS Command...
hadoop fs -cat '/output*/20140102*.txt' > output-complete/20140102.txt
single qoutes may not work, then try with double quotes
You can use hdfs command -getMerge for merging hdfs files.

Hadoop: How to generate custom reduce output file name?

Now, I use MultipuleOuputs.
I would like to remove the suffix string "-00001" from reducer's output filename such as "xxxx-[r/m]-00001".
Is there any idea?
Thanks.
From Hadoop javadoc to the write() method of MultipleOutputs:
Output path is a unique file generated for the namedOutput. For example, {namedOutput}-(m|r)-{part-number}
So you need to rename or merge these files on the HDFS.
I think you can do it on job driver. When your job completes, change the file names. Also you could do it via terminal commands.

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources