losing data when comparing a column with awk - bash

I have a text file and all I want to do is compare the third column and see if it's equal to 1 or 0, so I just simply used
awk '$3 == 1 { print $0 }' input > output1
awk '$3 == 0 { print $0 }' input > output2
This is part of a bash script and I'm certain there is a more elegant approach to this, but the code above should get the job done, only it does not. input has 425 rows of text, the third column in input is always a 1 or 0, therefore the total number of rows in output1 + output2 should be 425. But I get 417 rows.
Here is a sample of input (all of it is just one row, and there are 425 such rows):
out_first.dat 1 1 0.000000 265075.000000 6.000000e-01 1.005205e-03 9.000000e-01 9.000000e-01 2.889631e+00 -2.423452e+00 3.730018e+00 -1.532915e+00

if $3 is 1 or 0, it will be equal to its square, prints to output1/2. If not prints to other for inspection.
awk `$3*$3==$3{print > "output"(2-$3); next} {print > "other"}' file
if $3*$3==$3 is confusing, change to $3==0 || $3===1
for the curious $3==0 || $3===1 can be written as $3*($3-1)==0 from which the above follows.

Related

Awk if else with conditions

I am trying to make a script (and a loop) to extract matching lines to print them into a new file. There are 2 conditions: 1st is that I need to print the value of the 2nd and 4th columns of the map file if the 2nd column of the map file matches with the 4th column of the test file. The 2nd condition is that when there is no match, I want to print the value in the 2nd column of the test file and a zero in the second column.
My test file is made this way:
8 8:190568 0 190568
8 8:194947 0 194947
8 8:197042 0 197042
8 8:212894 0 212894
My map file is made this way:
8 190568 0.431475 0.009489
8 194947 0.434984 0.009707
8 19056880 0.395066 112.871160
8 101908687 0.643861 112.872348
1st attempt:
for chr in {21..22};
do
awk 'NR==FNR{a[$2]; next} {if ($4 in a) print $2, $4 in a; else print $2, $4 == "0"}' map_chr$chr.txt test_chr$chr.bim > position.$chr;
done
Result:
8:190568 1
8:194947 1
8:197042 0
8:212894 0
My second script is:
for chr in {21..22}; do
awk 'NR == FNR { ++a[$4]; next }
$4 in a { print a[$2], $4; ++found[$2] }
END { for(k in a) if (!found[k]) print a[k], 0 }' \
"test_chr$chr.bim" "map_chr$chr.txt" >> "position.$chr"
done
And the result is:
1 0
1 0
1 0
1 0
The result I need is:
8:190568 0.009489
8:194947 0.009707
8:197042 0
8:212894 0
This awk should work for you:
awk 'FNR==NR {map[$2]=$4; next} {print $4, map[$4]+0}' mapfile testfile
190568 0.009489
194947 0.009707
197042 0
212894 0
This awk command processes mapfile first and stores $2 as key with $4 as a value in an associative array named as map.
Later when it processes testfile in 2nd block we print $4 from 2nd file with the stored value in map using key as $4. We add 0 in stored value to make sure that we get 0 when $4 is not present in map.

joining 2 file and taking first file as priority

I'm looking for help on joining (at the UNIX level) two files (file1 and file2), picking values from file1 as a priority over the values in file2.
If a srcvalue exists in file1, that should be taken instead of file2's tmpValue. If there is no srcValue in file1, then pick up this value from file2's tmpValue.
Sample data:
file1:
id name srcValue
1 a s123
2 b s456
3 c
file2:
id tmpValue
1 Tva
3 TVb
4 Tvm
Desired output:
ID Name FinalValue
1 a s123
2 b s456
3 c TVb
I would approach this problem with an awk script; it is fairly powerful and flexible. The general approach here is to load the values from file2 first, then loop through file1 and substitute them as needed.
awk 'BEGIN { print "ID Name FinalValue" }
FNR == NR && FNR > 1 { tmpValue[$1]=$2; }
FNR != NR && FNR > 1 { if (NF == 2) {
print $1, $2, tmpValue[$1]
} else {
print $1, $2, $3
}
}
' file2 file1
The BEGIN block is executed before any files are read; its only job is to output the new header.
The FNR == NR && FNR > 1 condition is true for the first filename ("file2" here) and also skips the first line of that file (FNR > 1), since it's a header line. The "action" block for that condition simply fills an associative array with the id and tmpValue from file2.
The FNR != NR && FNR > 1 corresponds to the second filename ("file1" here) and also skips the first (header) line. In this block of code, we check to see if there's a srcValue; if so, print those three values back out; if not, substitute in the saved value (assuming there is one; otherwise, it'll be blank).
I assume that the <br> bits in the question are attempts at formatting, and that column 3 in file1 would actually be empty if there was no value there.

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Pattern matches by row name with multiple columns in multiple files

I have two files, one with a full list of gene names and three others with partial lists of gene names. I want to match these files all into one. All the partial files are different number of rows but have 3000 columns, all representing different cells. I have been trying to join these files completely but when I use awk only one column is kept.
mergedAll.txt
GENE
SOX2
BRCA1
BRCA2
RHO
ultimatecontrolMed.txt
GENE CELL1 CELL2 CELL3
SOX2 30 152 2000
BRCA2 400 234 73
RHO 12 2 0
My Desired output would be
GENE CELL1 CELL2 CELL3
SOX2 30 152 2000
BRCA1 0 0 0
BRCA2 400 234 73
RHO 12 2 0
I run:
awk 'NR==FNR{k[$1];next}{b[$1]=$0;k[$1]}
END{for(x in k)
if ( x== "GENE" )
printf"%s %s\n",x,b[x]
else
printf"%s %d\n",x,b[x]
}' mergedAll.txt ultimatecontrolMed.txt > test.txt
And I get:
GENE CELL1 CELL 2 CELL3
SOX2 2000
BRCA1 0
BRCA2 73
RHO 0
For some reason it will keep the last column of counts but not any of the other lines, and keep all the cell names. I don't have any experience with awk so this has been a major challenge for me overall and would love it if someone could offer a better solution.
awk to the rescue!
$ awk 'NR==FNR {a[$1]=$0; next}
{print (a[$1]?a[$1]:($1 FS 0 FS 0 FS 0))}' file2 file1 |
column -t
GENE CELL1 CELL2 CELL3
SOX2 30 152 2000
BRCA1 0 0 0
BRCA2 400 234 73
RHO 12 2 0
final pipe to column is for pretty printing. Note the order of the files.
Not to hard code the number of columns you can try this alternative
$ awk 'NR==1 {for(i=2;i<=NF;i++) missing=missing FS 0}
NR==FNR {a[$1]=$0; next}
{print (a[$1]?a[$1]:($1 missing))}' file2 file1
Could you please try following awk and let me know if this helps you.
awk 'FNR==NR{a[$0];next} ($1 in a){print;delete a[$1];next} END{for(i in a){print i,"0 0 0"}}' mergedAll.txt ultimatecontrolMed.txt
The problem is that you're printing b[x] with %d format. That's for printing a single integer, so it will ignore all the other integers in b[x]. Change
printf"%s %d\n",x,b[x]
to:
if (b[x]) {
printf "%s\t%s\n", x, b[x]
} else {
printf "%s" x;
for (i = 0; i < 3000; i++) printf "\t0"
print ""
}
so that it will print the entire value. If there's no corresponding value, it will print zeroes.
Replace 3000 with the appropriate number of cells. If you don't want to hard-code it, you can get it from NF-1 when FNR == 1 && FNR != NR (the first line of the second file).
join -a 1 -a 2 -e 0 -o 0 2.{2..4} mergedAll.txt ultimatecontrolMed.txt
2.{2..4} prints a list of output fields and can easily adapted to any number of fields.
As you mention three input files, it would be possible to pipe the result of a first join into a second one
join .... file1 file2 | join ... file3
join needs sorted input. That may be a killer argument for this solution.

Display only lines in which 1 column is equal to another, and a second column is in a range in AWK and Bash

I have two files. The first file looks like this:
1 174392
1 230402
2 4933400
3 39322
4 42390021
5 80022392
6 3818110
and so on
the second file looks like this:
chr1 23987 137011
chr1 220320 439292
chr2 220320 439292
chr2 2389328 3293292
chr3 392329 398191
chr4 421212 3292393
and so on.
I want to return the whole line, provided that the first column in FILE1 = the first line in FILE2, as a string match AND the 2nd column in file 2 is greater than column 2 in FILE2 but less than column 3 in FILE2.
So in the above example, the line
1 230402
in FILE1 and
chr1 220320 439292
in FILE2 would satisfy the conditions because 230402 is between 220320 and 439292 and 1 would be equal to chr1 after I make the strings match, therefore that line in FILE2 would be printed.
The code I wrote was this:
#!/bin/bash
$F1="FILE1.txt"
read COL1 COL2
do
grep -w "chr$COL1" FILE2.tsv \
| awk -v C2=$COL2 '{if (C2>$1 && C2<$2); print $0}'
done < "$F1"
I have tried many variations of this. I do not care if the code is entirely in awk, entirely in bash, or a mixture.
Can anyone help?
Thank you!
Here is one way using awk:
awk '
NR==FNR {
$1 = "chr" $1
seq[$1,$2]++;
next
}
{
for(key in seq) {
split(key, tmp, SUBSEP);
if(tmp[1] == $1 && $2 <= tmp[2] && tmp[2] <= $3 ) {
print $0
}
}
}' file1 file2
chr1 220320 439292
We read the first file in to an array using key as column 1 and 2. We add a string "chr" to column 1 while making it a key for easy comparison later on
When we process the file 2, we iterate over our array and split the key.
We compare the first piece of our key to column 1 and check if second piece of the key is in the range of second and third column.
If it satisfies our condition, we print the line.
awk 'BEGIN {i = 0}
FNR == NR { chr[i] = "chr" $1; test[i++] = $2 }
FNR < NR { for (c in chr) {
if ($1 == chr[c] && test[c] > $2 && test[c] < $3) { print }
}
}' FILE1.txt FILE2.tsv
FNR is the line number within the current file, NR is the line number within all the input. So the first block processes the first file, collecting all the lines into arrays. The second block processes any remaining files, searching through the array of chrN values looking for a match, and comparing the other two numbers to the number from the first file.
Thanks very much!
These answers work and are very helpful.
Also at long last I realized I should have had:
awk -v C2=$COL2 'if (C2>$1 && C2<$2); {print $0}'
with the brace in a different place and I would have been fine.
At any rate, thank you very much!

Resources