Hadoop stream sorting - sorting

Could anyone help on this hadoop streaming sort problem? Thanks for any suggestions in advance.
I am newbie on Hadoop and need to implement a sort function on a 500GB tab delimited text file. The following is an example input, there are 3 fields in one line like READA14 chr14 50989. Here I need to numeric sort by the 2nd and 3rd column, unless I set number of reducers to 1, I will never get the correct ordering result.
Example Input:
READA14 chr14 50989
READB18 chr18 517043
READC22 chr22 88345
READD10 chr10 994183
READE19 chr19 232453
READF20 chr20 42912
READF9 chr9 767396
READG22 chr22 783469
READG16 chr16 522257
READH9 chr9 826357
READH16 chr16 555098
READH21 chr21 128309
READH4 chr4 719890
READH18 chr18 944551
READH22 chr22 530068
READH9 chr9 212247
READH11 chr11 574930
READH22 chr22 664833
READH2 chr2 908178
READH22 chr22 486178
READH7 chr7 533343
READH6 chr6 109022
READH15 chr15 316353
READH20 chr20 439938
READH21 chr21 731912
READH11 chr11 81162
READH2 chr2 670838
READH15 chr15 729549
READH3 chr3 196626
READH14 chr14 841104
My code of streaming sort:
hadoop jar \
/home/hadoop-0.20.2-cdh3u5/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar \
-input /user/luoqin/projects/samsort/number \
-output /user/luoqin/projects/samsort/number_sort \
-mapper "cat" \
-reducer "sort -k 2.5 -n -k 3" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-jobconf map.output.key.field.separa="\t" \
-jobconf num.key.fields.for.partition=1 \
-jobconf mapred.data.field.separator="\t" \
-jobconf map.output.key.value.fields.spec="2:0-" \
-jobconf reduce.output.key.value.fields.spec="2:0-" \
-jobconf mapred.reduce.tasks=50
Results were partitioned into 50 parts cause reduce.task is set to 50. Viewing the results as, however it was not correct unless reduce.task is set to 1:
hadoop fs -cat /user/projects/samsort/number_sort/*

BY default hadoop uses a hash partitioner - in that the key output from your mapper 'hashed' to determine which reducer that key should be sent to. This hashing is what's causing your 'incorrect' results when you use more than a single reducer.
You should note that each output part is sorted, and now you just need to interleave the different parts to get your single sorted output.
You can solve this problem by implementing your own partitioner and sending key,value pairs to a reducer depending on the chrx value in your second field. You will however need couple your partitioner implementation and the number of reducers, otherwise you'll still similar to results to what you have at the moment.
So if you know the domain or range of values of your second column (lets say chr0 to chr255) then you could run a 256 reducer job with a custom partitioner based upon the int value after the chr string

Related

combine 2 text files with different number of rows in to csv file

I have 2 text files as below
A.txt (with 2 rows):
abc-1234
tik-3456
B.txt (with 4 rows)
123456
234567
987
12
I want to combine these 2 to get the below file in CSV format:
column-1 column-2
abc-1234 123456
tik-3456 234567
987
12
I am trying below command. However, not achieving the above result.
paste -d "," A.txt B.txt > C.csv
It is giving below output:
abc-1234
,123456
tik-3456,234567
,987
,12
Can anyone please let me know, what I am missing here?
In linux we have one utility that does one think very good. So:
paste merges files
column with -t creates tables
The following:
paste -d',' /tmp/1 /tmp/2 | column -t -N 'column-1,column-2' -s',' -o' '
outputs the desired result.

Find overlapping ranges between different files

I have two files each with a column forming ranges.
File 1
23241-24234
10023-12300
75432-82324
File 2
16722-17234
92000-94532
23600-25000
I am looking for ranges that overlap with a certain % (e.g. 50%) between the two files
In the previous example only the following will be printed (50% overlap):
23241-24234 23600-25000
I can do this using Python, but was wondering if there is a quicker bash command that would do the same thing.
In Python, I would write something like this:
f1='''\
23241-24234
10023-12300
75432-82324'''
f2='''\
16722-17234
92000-94532
23600-25000'''
f1ranges=[tuple(map(int, l.split('-'))) for l in f1.splitlines()]
for l in f2.splitlines():
b,e=map(int, l.split('-'))
s2=set(range(b,e))
for r in f1ranges:
s1=set(range(*r))
if len(s1 & s2)>len(s1)/2:
print r,(b,e)
Prints:
(23241, 24234) (23600, 25000)
It is hard to beat that with Bash utilities, but awk would be the only one to use.
The method I used in Python uses the shortcut of the intersection of a set to determine the length of the overlapping interval. You would need to replicate that set-type functionality or use arithmetic comparisons.
Here is an awk framework:
awk 'FNR==NR { f1[$0]; next }
{
split($0,a,"-")
for (e in f1) {
split(e,b,"-")
# add your range comparison logic here...
print a[1],a[2]," ",b[1],b[2], a[2]-b[1], b[2]-a[1]
}
} ' f1 f2
Convert it into a "fake" bed format and use bedtools intersect; https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
$ cat 1.bed
chr1 23241 24234
chr1 10023 12300
chr1 75432 82324
$ cat 2.bed
chr1 16722 17234
chr1 92000 94532
chr1 23600 25000
# sort both files
$ sort -k 1,1 -k2,2n 1.bed > 1.sort.bed
$ sort -k 1,1 -k2,2n 2.bed > 2.sort.bed
$ bedtools intersect -wa -wb -f 0.5 -a 1.sort.bed -b 2.sort.bed
chr1 23241 24234 chr1 23600 25000
you can parse the output and strip out the chr1 labels afterwards
obviously, bedtools is not a builtin bash program, but as you can see from the docs for the tools it has a huge amount of options that will likely be useful for you as soon as your needs become more complicated

vcf to ped format: redefine non-dbSNPs

When I am converting a vcf file to ped format (with vcftools or with vcf to ped converter of 1000G), I run into the problem that the IDs of the variants that don't have a dbSNP ID get the base pair position of that variant as an ID. Example of couple of variants:
1 rs35819278 0 23333187
1 23348003 0 23348003
1 23381893 0 23381893
1 rs18325622 0 23402111
1 rs23333532 0 23408301
1 rs55531117 0 23810772
1 23910834 0 23910834
However, I would like the variants without dbSNP ID to get the the format "chr:basepairposition". So the example of above would look like:
1 rs35819278 0 23333187
1 chr1:23348003 0 23348003
1 chr1:23381893 0 23381893
1 rs18325622 0 23402111
1 rs23333532 0 23408301
1 rs55531117 0 23810772
1 chr1:23910834 0 23910834
Would be great if anyone could help me to explain what command or which script I have to use to change this 2nd column for the variants without a dbSNP ID.
Thanks!
This can be done with sed. Since tabs are involved, the exact syntax may vary a bit depending on what sed is installed on your system; the following should work for Linux:
cat [.map filename] | sed 's/^\([0-9]*\)\t\([0-9]\)/\1\tchr\1:\2/g' > [new filename]
This looks for lines starting with [number][tab][digit], and makes them start with [number][tab]chr[number]:[digit] instead, while leaving other lines unchanged.
OS X is a bit more painful (you'll need to use ctrl-V or [[:blank:]] to deal with the tab).
This can be done with plink2. You just need to use the --set-missing-var-ids option (https://www.cog-genomics.org/plink2/data#set_missing_var_ids) accordingly:
plink --vcf [filename] \
--keep-allele-order \
--vcf-idspace-to _ \
--double-id \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--set-missing-var-ids chr#:# \
--make-bed \
--out [prefix]
However, notice that you could have multiple variants being assigned the same IDs using this method and plink2 will not tolerate variants with the same ID. To learn more about converting VCF files to plink, the following resource has further insights: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html

`join` with -e "NA" parameter incorrectly fills "NA" into a non-empty field

I am encountering a weird issue with join in a script I've written.
I have two files, say:
File1.txt (1st field: cluster size; 2nd field: brain coordinates)
54285;-40,-64,-2
5446;-32,6,24
File2.txt (1st field: cluster index; 2nd field: z-value; 3rd field: brain coordinates)
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
1;5.66;-50,16,34
Where I am joining on the last field, the brain coordinates.
When I use the command
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" File1.txt File2.txt
I expect
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
But I get
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
1;NA;5.66;-50,16,34
Such that the cluster size is missing on row 3 (i.e., cluster size for cluster #1, "5446").
If I edit File2 to take out lines that don't have a match in File1, i.e.:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
I get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
If I edit File2.txt like so, adding a line without a cluster-size value to cluster #1:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
1;5.66;-50,16,34
I also get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
BUT, if I edit File2.txt like so, adding a line without a cluster-size value to cluster #2:
File2.txt
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
Then I do not receive the expected output:
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
Can anyone give me any insight into why this is occurring? Have I done something wrong, or is there something quirky going on with join that I haven't been able to suss out from the man page?
Although alternative solutions to joining these files (that is, using different tools than join) , I am most interested in figuring out why the current command isn't working.
Input files to the join command must be sorted on join fields
Try this instead (note that this uses process substitution, which is a bashism)
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" <(sort -k2,2 -t';' File1.txt)\
<(sort -k3,3 -t';' File2.txt)
1;5446;5.78;-32,6,24
2;54285;7.59;-40,-64,-2
1;NA;5.66;-50,16,34
2;NA;7.33;62,-60,14

using paste command in a loop

I am using Fedora, and bash to do some text manipulation with the files I have. I am trying to combine a large number of files, each one with two columns of data. From these files, I want to extract the data on the 2nd column of the files, and put it in a single file. Previously, I used the following script:
paste 0_0.dat 0_6.dat 0_12.dat | awk '{print $1, $2, $4}' >0.dat
But this is painfully hard as the number of files gets larger -- trying to do with 100 files. So I looked through the web to see if there's a way to achieve this in a simple way, but come up empty-handed.
I'd like to invoke a 'for' loop, if possible -- for example,
for i in $(seq 0 6 600)
do
paste 0_0.dat | awk '{print $2}'>>0.dat
done
but this does not work, of course, with paste command.
Please let me know if you have any recommendations on how to do what I'm trying to do ...
DATA FILE #1 looks like below (deliminated by a space)
-180 0.00025432
-179 0.000309643
-178 0.000189226
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
DATA FILE #2
-180 0.0002352
-179 0.000423452
-178 0.00019304
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
First column goes from -180 to 180, with increment of 1.
DESIRED
(n is the # of columns; and # of files)
-180 0.00025432 0.00025123 0.000235123 0.00023452 0.00023415 ... n
-179 0.000223432 0.0420504 0.2143450 0.002345123 0.00125235 ... n
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
179 0.002352534 ... n
180 0.001504992 ... n
Thanks,
join can get you your desired result.
join <(sort -r file1) <(sort -r file2)
Test:
[jaypal:~/Temp] cat file1
-180 0.00025432
-179 0.000309643
-178 0.000189226
[jaypal:~/Temp] cat file2
-180 0.0005524243
-179 0.0002424433
-178 0.0001833333
[jaypal:~/Temp] join <(sort -r file1) <(sort -r file2)
-180 0.00025432 0.0005524243
-179 0.000309643 0.0002424433
-178 0.000189226 0.0001833333
To do multiple files at once, you can use it with find command -
find . -type f -name "file*" -exec join '{}' +
How about this:
paste "$#" | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
This assumes that you don't run into a limit with paste (check how many open files it can have). The "$#" notation means 'all the arguments given, exactly as given'. The awk script simply prints $1 from each line of pasted output, followed by the even-numbered columns; followed by a newline. It doesn't validate that the odd-numbered columns all match; it would perhaps be sensible to do so, and you could code a vaguely similar loop to do so in awk. It also doesn't check that the number of fields on this line is the same as the number on the previous line; that's another reasonable check. But this does do the whole job in one pass over all the files - for an essentially arbitrary list of files.
I have 100 input files -- how do I use this code to open up these files?
You put my original answer in a script 'filter-data'; you invoke the script with the 101 file names generated by seq. The paste command pastes all 101 files together; the awk command selects the columns you are interested in.
filter-data $(seq --format="0_%g.dat" 0 6 600)
The seq command with the format will list you 101 file names; these are the 101 files that will be pasted.
You could even do without the filter-data script:
paste $(seq --format="0_%g.dat" 0 6 600) | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
I'd probably go with the more general script as the main script, and if need be I'd create a 'one-liner' that invokes the main script with the specific set of arguments currently of interest.
The other key point which might be a stumbling block: paste is not limited to 2 files only; it can paste as many files as you can have open (give or take about 3).
Based on my assumptions that you see in the comments above, you don't need paste. Try this
awk '{
arr[$1] = arr[$1] "\t" $2 };
END {for (x=-180;x<=180;x++) print x "\t" arr[x]
}' *.txt \
| sort -n
Note that we just take all of the values into an array based on the value in the first field, and append values based on the $1 key. After all data has been read in, The END section prints out the key and the value. I've added things like "x=", ":vals= " to help 'explain' what is happening. Remove those for completely clean tab-seperated data. Change '\t' to ':' or '|', or ... shudder ',' if you need to. Change the *.txt to what every your filespec is.
Be aware that all Unix command lines have limitations to the number and size (length of filenames, not the data inside), of filenames that can be processed in 1 invocation. Let us know if you get error messages about that.
The pipe to sort ensures that data is sorted by column1.
With my test data, the output was
-178 0.0001892261 0.0001892262 0.0001892263 0.000189226
-179 0.0003096431 0.0003096432 0.0003096433 0.000309643
-180 0.000254321 0.000254322 0.000254323 0.00025432
178 0.0001892261 0.0001892262 0.0001892263 0.000189226
179 0.0003096431 0.0003096432 0.0003096433 0.000309643
180 0.000254321 0.000254322 0.000254323 0.00025432
Based on 4 files of input.
I hope this helps.
P.S. Welcome to StackOverflow (S.O.) Please remeber to read the FAQs, http://tinyurl.com/2vycnvr , vote for good Q/A by using the gray triangles, http://i.imgur.com/kygEP.png , and to accept the answer that bes solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png
This might work for you:
echo *.dat | sed 's/\S*/<(cut -f2 &)/2g;s/^/paste /' | bash >all.dat

Resources