hadoop file splitting using KeyFieldBasedPartitioner - hadoop

I have a big file that is formatted as follows
sample name \t index \t score
And I'm trying to split this file based off of sample name using Hadoop Streaming.
I know ahead of time how many samples there are, so can specify how many reducers I need.
This post is doing something very similar, so I know that this is possible.
I tried using the following script to split this file into 16 files (there are 16 samples)
hadoop jar $STREAMING \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.num.map.output.key.fields=2 \
-D mapred.reduce.tasks=16 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper cat \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-input input_dir/*part* -output output_dir
This somewhat works - some of the files contain only one sample name. However most of the part* files are blank and some of the part* files contain multiple sample names.
Is there a better way to make sure that every reducer gets only one sample name?

FYI, there is actually a much cleaner way to split up files using a custom OutputFormat
This link describes how to do this really well. I ended up tailoring this other link for my specific application. Altogether, its only a few extra lines of Java

Related

BWA fail to locate the index files

I'm currently working on trying to analyze a dataset. I'm new to the field of bioinformatics and was trying to use BWA tools, however, as soon as I reach bwa mem, I keep running into the same error:
input --> mirues-macbook:sra ipmiruek$ bwa mem -t 8 Homo_sapiens.GRCh38.dna.chromosome.17.fa ERR3841737/ERR3841737_trimmed.fq.gz > ERR3841737/ERR3841737_mapped.sam
output --> [E::bwa_idx_load_from_disk] fail to locate the index files
I've already indexed the reference chromosome as such:
bwa index Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
Is there anything I could do to fix this problem? Thank you.
I tried changing the dataset that I was using along with the corresponding reference chromosome but it still yielded the same result. Is this an issue with the code or with the dataset I'm working with?
It looks like you indexed a gzip-compressed FASTA file, but are supplying an index base (idxbase) without the .gz extenstion. What you want is:
$ bwa mem \
-t 8 \
Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz \
ERR3841737/ERR3841737_trimmed.fq.gz \
> ERR3841737/ERR3841737_mapped.sam
Alternatively, gunzip the reference FASTA file and index it. For example:
$ gunzip Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
$ bwa index Homo_sapiens.GRCh38.dna.chromosome.17.fa
Note that BWA packs the reference sequences (into the .pac file), so you don't even need the FASTA file to run BWA MEM after it's been indexed.

How to replace a value of variable in one file from text file using bash script?

I have one Makefile in which I want to change/replace the value of a variable during runtime with bash script.
Makefile content:-
SUBDIRS = common alarm coders crypto communication conup database \
dynamicProtocols dynamethods dynamic TTStorage tables \
jobManager logManager processManager \
collection processing distribution mediationManager adaptations tools performanceMonitor \
cli
Now I want this SUBDIRS value to be replaced with content of my text file.
Text file content:-
database
common
jobManager
coders
process
This text file content may vary from 1-20 words.
Now as suggested in another thread, we used below solution for single word:-
sed -r 's/(SUBDIRS = ).*/\1protocols/' Makefile
But this only replaces first line with 'protocols'. Generated output is:-
SUBDIRS = protocols
dynamicProtocols dynamethods dynamic TTStorage tables \
jobManager logManager processManager \
collection processing distribution mediationManager adaptations tools performanceMonitor \
cli
While desired output is:-
SUBDIRS = protocols
Now, we want to read all contents of text file and assign to SUBDIRS. Shown below:
SUBDIRS = database common jobManager coders process
Please suggest.
sed approach:
sed -z -e 's/\n//g' -e "s/\(SUBDIRS = \).*/\1$(tr '\n' ' ' < words)\n/;" Makefile
words is a filename containing from 1-20 words
-z - treat the input as a set of lines, each terminated by a zero byte
s/\n//g - removing newlines in the Makefile
$(tr '\n' ' ' < words) - translating newlines into spaces in the words file
The output:
SUBDIRS = database common jobManager coders process

How to do a secondary sort on filenames with numbers in hadoop streaming?

I'm trying to sort file names such as
cat1.pdf, cat2.pdf, ... cat10.pdf ...
I'm utilizing a sort right now with the following parameters:
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator
-D stream.num.map.output.key.fields=2
-D mapreduce.partition.keypartitioner.options="-k1,1"
-D mapreduce.partition.keycomparator.options="-k1,1 -k2,2 -V"
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
The key value pairs are separated by a tab with the file name as the value and a string as the key. The problem is that my sort right now secondary sorts the file names such that I get
cat1.pdf, cat10.pdf, cat2.pdf, cat3.pdf, cat30.pdf ...
How can I get it such that the files are sorted like this:
cat1.pdf, cat2.pdf, cat3.pdf ... cat10.pdf,cat11.pdf...
I'm using hadoop streaming 2.7.1

using split command in shell how to know the number of files generated

I am using command split for a large file to generate little files which are put in a folder, my problem is the folder contains over files different from my split.
I would like to know if there is a way to know how much files are generated only from my split not the number of all files in my folder.
My command split a 2 d. Is there any option I can join to this command to know it?
I know this ls -Al | wc -l will give me the number of files in the folder that doesn't interest me.
The simplest solution here is to split into a fresh directory.
Assuming that's not possible and you aren't worried about other processes operating on the directory in question you can just count the files before and after. Something like this
$ before=(*)
$ split a 2 d
$ after=(*)
$ echo "Split files: $((after - before))"
If the other files in the directory can't have the same format as the split files (and presumably they can't or split would fail or overwrite them) then you could use an appropriate glob to get just the files that match the pattern. Soemthing like splitfiles=(d??).
That failing you could see whether the --verbose option to split allows you to use split_count=$(split --verbose a 2 d | wc -l) or similar.
To be different, I will be counting the lines with grep utilizing the --verbose option:
split --verbose other_options file|grep -c ""
Example:
$ split --verbose -b 2 file|grep -c ""
60
# yeah, my file is pretty small, splitting on 2 bytes to produce numerous files
You can use split command with options -l and -a to specify prefix and suffix for the generated files.

Doubts regarding the below MakeFile

I am new to nmake. I came across this Makefile for nmake. This is a legacy part of application that i need to support. This is working fine, its just that i had problems in understanding the syntax. I tried my best on google but still couldn't understand some intricacies.
I am curious about the concept of using 2 colons in the :INSTALL: rule below. Also not able to get the :B:S variable editing.
JAVA_FILES = \
Abc.java \
Def.java \
Ghi.java
/***********Define targets **********************/
.all : $(JAVA_FILES:B:S=.class)
:INSTALL: .all
$(JAVA_FILES:B:S=.class) : $(JAVA_FILES) .JOINT
for node in `echo $VPATH | tr ':' ' '`
do
nodecp=${nodecp}:${node}/gfp_scom:${node}/gfp_scom/lib/GfpScomWsdl.jar:${node}/gfp_scom/lib/jmxtools.jar
done
$(JAVAC) -cp ${nodecp}:$(CLASSPATH) -d ../../ $(*:M!=\.class)

Resources