I have input file as below, need to do this conversion col1*0 + col2*1 + col3*2 for every 3 column triplet.
input.txt - All positive numbers, can be decimals, real file has 1000s of columns.
0 0 0 1 0 0
0 1 0 0 0 1
0 0 1 0 0 0
I have the below gawk line that does that:
gawk '{for(i=1;i<=NF;i+=3)x=(x?x FS:"")(($(i+1))+($(i+2)*2));print x;x=y}' input.txt
0 0
1 2
2 0
Additionally, I need to check if 3 numbers are all zeros, if they are all zeros then the conversion should be -9.
Pseudo code:
if($i==0 & $(i+1)==0 & $(i+2)==0) {-9} else {$(i+1)+$(i+2)*2}
#or as all numbers are positive.
if(($i+$(i+1)+$(i+2))==0) {-9} else {$(i+1)+$(i+2)*2}
Expected output:
-9 0
1 2
2 -9
Data description:
This data is output from IMPUTE2 software - a genotype imputation and haplotype phasing program. Rows are SNPs, columns are samples. Every SNP is represented by 3 columns. 3 numbers per SNP with range 0-1 (probability of allele AA AB BB). So in above example we have 3 SNPs and 2 samples. Imputation can also be represented as dosage value, 1 number per SNP with range 0-2. We are trying to covert probability format into dosage format. When IMPUTE2 can't give any probabilities to any of the alleles, it outputs as 0 0 0, then we should convert as no call -9.
You want the sum to be different if the three given columns are 0. For this, you can expand the ternary operator to something like>
gawk '{ for(i=1;i<=NF;i+=3) {
x=$(i+1) + $(i+2)*2; # the sum
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
}
print res; res="" # print stored line and empty for next loop
}' file
That is, append the value -9 if all the elements are 0. Otherwise, the calculated x:
res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^
if three columns are 0..........|
If all values are positive, the check can be reformatted to just compare if the sum is 0 or not.
($i + $(i+1) + $(i+2)) ? x : -9
Testing with your file apparently works:
$ gawk '{for(i=1;i<=NF;i+=3) {x=$(i+1) + $(i+2)*2; res=res (res ? FS : "") ($i==0 && $(i+1)==0 && $(i+2)==0 ?-9:x)} print res; res=""}' file
-9 0
1 2
2 -9
another awk one-liner (assuming non-negative input values)
$ awk '{c1=$2+2*$3;c2=$5+2*$6; print c1||$1?c1:-9,c2||$4?c2:-9}' lop
-9 0
1 2
2 -9
Related
I want to write a shell script which can increment the last value of the first line of a certain file structure:
File-structure:
p cnf integer integer
integer integer ... 0
For Example:
p cnf 11 9
1 -2 0
3 -1 5 0
To:
p cnf 11 10
1 -2 0
3 -1 5 0
The dots should stay the same.
If you could use perl:
perl -pe 's/(-*\d+)$/$1+1/e' if $. == 1' inputfile
Here (-*\d+)$ is capturing integer value(optionally negative) at the end of the line and e flag allows the execution of code before replacement, so the value increments.
With GNU awk:
awk 'NR==1{$NF++} {print}' file
or
awk 'NR==1{$NF++}1' file
Output:
p cnf 11 10
1 -2 0
3 -1 5 0
$NF contains last column.
I am using bash in order to process software responses on-the-fly and I am looking for a way to find the
index of the maximum element in the array.
The data that gets fed to the bash script is like this:
25 9
72 0
3 3
0 4
0 7
And so I create two arrays. There is
arr1 = [ 25 72 3 0 0 ]
arr2 = [ 9 0 3 4 7 ]
And what I need is to find the index of the maximum number in arr1 in order to use it also for arr2.
But I would like to see if there is a quick - optimal way to do this.
Would it maybe be better to use a dictionary structure [key][value] with the data I have? Would this make the process easier?
I have also found [1] (from user jhnc) but I don't quite think it is what I want.
My brute - force approach is the following:
function MAX {
arr1=( 25 72 3 0 0 )
arr2=( 9 0 3 4 7 )
local indx=0
local max=${arr1[0]}
local flag
for ((i=1; i<${#arr1[#]};i++)); do
#To avoid invalid arithmetic operators when items are floats/doubles
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
if [ $flag == "True" ]; then
indx=${i}
max=${arr1[${i}]}
fi
done
echo "MAX:INDEX = ${max}:${indx}"
echo "${arr1[${indx}]}"
echo "${arr2[${indx}]}"
}
This approach obviously will work, BUT, is it the optimal one? Is there a faster way to perform the task?
arr1 = [ 99.97 0.01 0.01 0.01 0 ]
arr2 = [ 0 6 4 3 2 ]
In this example, if an array contains floats then I would get a
syntax error: invalid arithmetic operator (error token is ".97)
So, I am using
flag=$( python <<< "print(${arr1$[${i}]} > ${max})")
In order to overcome this issue.
Finding a maximum is inherently an O(n) operation. But there's no need to spawn a Python process on each iteration to perform the comparison. Write a single awk script instead.
awk 'BEGIN {
split(ARGV[1], a1);
split(ARGV[2], a2);
max=a1[1];
indx=1;
for (i in a1) {
if (a1[i] > max) {
indx = i;
max = a1[i];
}
}
print "MAX:INDEX = " max ":" (indx - 1)
print a1[indx]
print a2[indx]
}' "${arr1[*]}" "${arr2[*]}"
The two shell arrays are passed as space-separated strings to awk, which splits them back into awk arrays.
It's difficult to do it efficiently if you really do need to compare floats. Bash can't do floats, which means invoking an external program for every number comparison. However, comparing every number in bash, is not necessarily needed.
Here is a fast, pure bash, integer only solution, using comparison:
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
# Get the maximum, and also save its index(es)
for i in "${!arr1[#]}"; do
if ((arr1[i]>arr1_max)); then
arr1_max=${arr1[i]}
max_indexes=($i)
elif [[ "${arr1[i]}" == "$arr1_max" ]]; then
max_indexes+=($i)
fi
done
# Print the results
printf '%s\n' \
"Array1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Here is another optimal method, that can handle floats. Comparison in bash is avoided altogether. Instead the much faster sort(1) is used, and is only needed once. Rather than starting a new python instance for every number.
#!/bin/bash
arr1=( 25 72 3 0 0)
arr2=( 9 0 3 4 7)
arr1_max=$(printf '%s\n' "${arr1[#]}" | sort -n | tail -1)
for i in "${!arr1[#]}"; do
[[ "${arr1[i]}" == "$arr1_max" ]] &&
max_indexes+=($i)
done
# Print the results
printf '%s\n' \
"Array 1 max is $arr1_max" \
"The index(s) of the maximum are:" \
"${max_indexes[#]}" \
"The corresponding values from array 2 are:"
for i in "${max_indexes[#]}"; do
echo "${arr2[i]}"
done
Example output:
Array 1 max is 72
The index(s) of the maximum are:
1
The corresponding values from array 2 are:
0
Unless you need those arrays, you can also feed your input script directly in to something like this:
#!/bin/bash
input-script |
sort -nr |
awk '
(NR==1) {print "Max: "$1"\nCorresponding numbers:"; max = $1}
{if (max == $1) print $2; else exit}'
Example (with some extra numbers):
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
sort -nr |
awk '(NR==1) {max = $1; print "Max: "$1"\nCorresponding numbers:"}
{if (max == $1) print $2; else exit}'
Max: 72
Corresponding numbers:
4
11
0
You can also do it 100% in awk, including sorting:
$ echo \
'25 9
72 0
72 11
72 4
3 3
3 14
0 4
0 1
0 7' |
awk '
{
col1[a++] = $1
line[a-1] = $0
}
END {
asort(col1)
col1_max = col1[a-1]
print "Max is "col1_max"\nCorresponding numbers are:"
for (i in line) {
if (line[i] ~ col1_max"\\s") {
split(line[i], max_line)
print max_line[2]
}
}
}'
Max is 72
Corresponding numbers are:
0
11
4
Or, just to get the maximum of column 1, and any single number from column 2, that corresponds with it. As simply as possible:
$ echo \
'25 9
72 0
3 3
0 4
0 7' |
sort -nr |
head -1
72 0
My file looks like this
0 0 1 0.2 1 1
1 1 0.8 0.1 1
0.2 0.4 1 0 1
And I need to a create a new output file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
i.e. if the number is greater than 0.5, it is rounded up to 1, and if it less than 0.5, it is rounded down to 0 and put into a new file.
The file is quite large, with ~ 1400000000 values. I would quite like to write a bash script to do this.
I am guessing the best way to do this would be to iterate over each value in a for loop, with an if statement inside which tests whether the number is greater or less than 0.5 and then prints 0 or 1 dependent.
The pseudocode would look like this, but my bash isn't great so - before you tell my it isnt syntatically correct, I already know
#!/bin/bash
#reads in each line
while read p; do
#loops through each number in each line
for i in p; do
#tests if each number is greater than or equal to 0.5 and prints accordingly
if [i => 0.5]
then
print 1
else
print 0
fi
done < test.txt >
I'm not really sure how to do this. Can anyone help? Thanks.
awk '{
for( i=1; i<=NF; i++ )
$i = $i<0.5 ? 0 : 1
}1' input_file > output_file
$i = $i<0.5 ? 0 : 1 changes each field to 0 or 1 and {...}1 will print the line with the changed values afterwards.
another awk without loops...
$ awk -v RS='[ \n]' '{printf ($1>=0.5) RT}' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
if the values are not between 0 and 1, you may want to change to
$ awk -v RS='[ \n]' '{printf "%.0f%s", $1, RT}' file
note that default rounding is to the even (i.e. 0.5 -> 0, but 1.5 -> 2). If you want always to round up
$ awk -v RS='[ \n]' '{i=int($1); printf "%d%s", i+(($1-i)>=0.5), RT}' file
should take of non-negative numbers. For negatives, there are again two alternatives, round towards zero or towards negative infinity.
Here's one in Perl using regex and look-ahead:
$ perl -p -e 's/0(?=\.[6789])/1/g;s/\.[0-9]+//g' file
0 0 1 0 1 1
1 1 1 0 1
0 0 1 0 1
I went with the if it less than 0.5, it is rounded down to 0 part.
I have a large file that I would like to break into chunks by field 2. Field 2 ranges in value from about 0 to about 250 million.
1 10492 rs55998931 C T 6 7 3 3 - 0.272727272727273 0.4375
1 13418 . G A 6 1 2 3 DDX11L1 0.25 0.0625
1 13752 . T C 4 4 1 3 DDX11L1 0.153846153846154 0.25
1 13813 . T G 1 4 0 1 DDX11L1 0.0357142857142857 0.2
1 13838 rs200683566 C T 1 4 0 1 DDX11L1 0.0357142857142857 0.2
I want field 2 to be broken up into intervals of 50,000, but overlapping by 2,000. For example, the first three awk commands would look like:
awk '$1=="1" && $2>=0 && $2<=50000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.0kb.50kb
awk '$1=="1" && $2>=48000 && $2<=98000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.48kb.98kb
awk '$1=="1" && $2>=96000 && $2<=146000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.96kb.146kb
I know that there's a way I can do this using a for loop with variables like i and j. Can someone help me out?
awk '$1=="1"{n=int($2/48000); print>("chr1." (48*n) "kb." (48*n+50) "kb");n--; if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb");}' Highalt.Lowalt.allelecounts.filteredformissing.freq
Or spread out over multiple lines:
awk '$1=="1"{
n=int($2/48000)
print>("chr1." (48*n) "kb." (48*n+50) "kb")
n--
if (n>=0 && $2/1000<=48*n+50)
print>("chr1." (48*n) "kb." (48*n+50) "kb")
}' Highalt.Lowalt.allelecounts.filteredformissing.freq
How it works
$1=="1"{
This selects all lines whose first field is 1. (You didn't mention this in the text but your code applied this restriction.
n=int($2/48000)
This computes which bucket the line belongs in.
print>("chr1." (48*n) "kb." (48*n+50) "kb")
This writes the line to the appropriate file
n--
This decrements the bucket number
if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb")
If this line also fits within the overlapping range of the previous bucket, then write it to that bucket also.
}
This closes the group started by selecting $1=="1".
I have a file like this
file.txt
0 1 a
1 1 b
2 1 d
3 1 d
4 2 g
5 2 a
6 3 b
7 3 d
8 4 d
9 5 g
10 5 g
.
.
.
I want reset row number count to 0 in first column $1 whenever value of field in second column $2 changes, using awk or bash script.
result
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
.
.
.
As long as you don't mind a bit of excess memory usage, and the second column is sorted, I think this is the most fun:
awk '{$1=a[$2]+++0;print}' input.txt
This awk one-liner seems to work for me:
[ghoti#pc ~]$ awk 'prev!=$2{first=0;prev=$2} {$1=first;first++} 1' input.txt
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
Let's break apart the script and see what it does.
prev!=$2 {first=0;prev=$2} -- This is what resets your counter. Since the initial state of prev is empty, we reset on the first line of input, which is fine.
{$1=first;first++} -- For every line, set the first field, then increment variable we're using to set the first field.
1 -- this is awk short-hand for "print the line". It's really a condition that always evaluates to "true", and when a condition/statement pair is missing a statement, the statement defaults to "print".
Pretty basic, really.
The one catch of course is that when you change the value of any field in awk, it rewrites the line using whatever field separators are set, which by default is just a space. If you want to adjust this, you can set your OFS variable:
[ghoti#pc ~]$ awk -vOFS=" " 'p!=$2{f=0;p=$2}{$1=f;f++}1' input.txt | head -2
0 1 a
1 1 b
Salt to taste.
A pure bash solution :
file="/PATH/TO/YOUR/OWN/INPUT/FILE"
count=0
old_trigger=0
while read a b c; do
if ((b == old_trigger)); then
echo "$((count++)) $b $c"
else
count=0
echo "$((count++)) $b $c"
old_trigger=$b
fi
done < "$file"
This solution (IMHO) have the advantage of using a readable algorithm. I like what's other guys gives as answers, but that's not that comprehensive for beginners.
NOTE:
((...)) is an arithmetic command, which returns an exit status of 0 if the expression is nonzero, or 1 if the expression is zero. Also used as a synonym for let, if side effects (assignments) are needed. See http://mywiki.wooledge.org/ArithmeticExpression
Perl solution:
perl -naE '
$dec = $F[0] if defined $old and $F[1] != $old;
$F[0] -= $dec;
$old = $F[1];
say join "\t", #F[0,1,2];'
$dec is subtracted from the first column each time. When the second column changes (its previous value is stored in $old), $dec increases to set the first column to zero again. The defined condition is needed for the first line to work.