Running Word Count or Pig Script on a Directory to produce result in separate files - hadoop

I am new to Hadoop/Pig.
I have a directory which has several files. Now I need to run a word count on those. I can use the Hadoop sample example wordcount and run it on the directory to get the output, but the output will be in a single file. What should I do if I want the output of each file should be in a different file?
I can use Pig too. And give the directory as input to pig. However how can I read the file names inside the Directory and then give it to the LOAD?
What I meant is:
Say I have a directory Test which has 5 files test1, test2, test3, test4, test5. Now I want the word count of each file separately in a separate file. I know I can provide individual names and do it, but that would take a lot of time.
Is it possible that I can read filenames from the directory and provide them as input to LOAD of pig?

If you're using Pig version 0.10.0 or later, you can take advantage of a combination of source tagging and MultiStorage to keep track of the files.
For example, if you had an input directory pigin with files and content as the following:
pigin
|-test1 => "hello"
|-test2 => "world"
|-test3 => "Apache"
|-test4 => "Hadoop"
|-test5 => "Pig"
The following script will read each script and write the contents of each file to a different directory.
%declare inputPath 'pigin'
%declare outputPath 'pigout'
-- Define MultiStorage to write output to different directories based on the
-- first element in the tuple
define MultiStorage org.apache.pig.piggybank.storage.MultiStorage('$outputPath','0');
-- Load the input files, prepending each tuple with the file name
A = load '$inputPath' using PigStorage(',', '-tagsource');
-- Write output to different directories
store A into '$outputPath' using MultiStorage();
The above script will create an output directory tree that looks like the following:
pigout
|-test1
| `-test1-0 => "test1 hello"
|-test2
| `-test2-0 => "test2 world"
|-test3
| `-test3-0 => "test3 Apache"
|-test4
| `-test4-0 => "test4 Hadoop"
|-test5
| `-test5-0 => "test5 Pig"
The -0 at the end of the filenames correspond to the reducers that produced the output. If you have more than one reducer, you may see more than one file per directory.

You could extend the PigStorage code to add the file name to the tuple, see Code Sample look for question "Q: I load data from a directory which contains different file. How do I find out where the data comes from?". For the output you could do similar extension of the PigStorage to write into different output files.

Related

How to loop over multiple folders to concatenate FastQ files?

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
Folder structure
Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.

Combine CSV files with condition

I need to combine all the csv files in some directory (.csv), provided that there are other files with the same name in this directory, but with different expansion (.csv.done).
If a csv file doesn't have .done in this extension then I don't need it for combine process.
What is the best way to do it using Bash ?
This approach is a solution to your problem. I see you've commented that it "didn't work", but whatever the reason is for it not working, it's likely simple to fix e.g. if you forgot to include key details, or failed to adapt it appropriately to suit your specific situation. If you need further help troubleshooting, add more info to your question.
The approach:
for f in *.csv.done
do
cat "${f%.*}" >> combined_file.csv
done
How it works:
In your example, you have 3 files named 1.csv 2.csv 3.csv and two 'done' files named 1.csv.done 2.csv.done.
This script begins by making a list of all files that end in .csv.done (two files: 1.csv.done 2.csv.done).
It then uses a parameter expansion, specifically ${parameter%word}, to 'shorten' the name of the two files in the list to .csv (instead of .csv.done).
Then it 'prints' the content of the two 'shortened' filenames (1.csv and 2.csv) into a 'combined' file.
It doesn't 'print' the content of 1.csv.done or 2.csv.done, or 3.csv, because these files weren't in the original 'list'.
If you run this script multiple times, it will keep adding the contents of files 1.csv and 2.csv to the 'combined' file (only run it once, or delete the 'combined' file before running it again)

wanted to use results of find command in custom script that i am building

I want to validate my XML's for well-formed ness, but some of my files are not having a single root (which is fine as per business req eg. <ri>...</ri><ri>..</ri> is valid xml in my context) , xmlwf can do this, but it flags out a file if it's not having single root, So wanted to build a custom script which internally uses xmlwf, my custom script should do below,
iterate through list of files passed as input (eg. sample.xml or s*.xml or *.xml)
for each file prepare a temporary file as <A>+contents of file+</A>
and call xmlwf on that temp file,
Can some one help on this?
You could add text to the beginning and end of the file using cat and bash, so that your file has a root added to it for validation purposes.
cat <(echo '<root>') sample.xml <(echo '</root>') | xmlwf
This way you don't need to write temporary files out.

Reading files from multiple directories in Logstash?

I read my log files (cron_log, auth_log, mail_log, etc) using this config:
file{
path => '/path/to/log/file/*_log'
}
So I read my log files and check:
if(path) ~= "cron" -----match--------
if(path) ~= "auth" -----match--------
Now I have a directories like: Server1 Server2 Server3......In Server 1 there are subdirectories: authlog cronlog.....Inside authlog there are subdirectories date wise (like 2014.05.26, 2014.05.27) which finally contain log file for the day, which I have to parse.
So presently I was having one config file which use to read files using *_log and I use to run that config file and all log files present in /path/to/log/file/*_log were parsed.
Now I have to read from many directories (as explained above).
Will I have to write separate config file for each directory??
What's the best way to achieve this using logstash??
Ruby globs interpret ** as including all subdirectories.
So, for example, you could give the file input a path such as:
/path/to/date/folders/**/*_log

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources