How to do cumulative and consecutive sums for every column in a tab file (UNIX environment) - bash

I have a tabulated file something like that
Q8VYA50 210 69 2 8 3
Q8VYA50 208 69 1 2 8 3
Q9C8G30 316 182 4 4 7
P335430 657 98 1 10 7
That I would like to do is to apply a cumulative sum from the 4rd column up to NF and print in every column the result of the sum for this column and the original value of previous columns if any. So that, the desired output would be
Q8VYA50 210 69 2 10 13
Q8VYA50 208 69 1 3 11 14
Q9C8G30 316 182 4 8 15
P335430 657 98 1 11 18
I have tried to do it through different ways by means of sum function inside an awk script including for-loop specifying the fields where must apply the cumulative sum. However, the result obtained is wrong.
Are there some way to do it correctly by Unix (Bash)? Thanks in advance!
This is one way I have tried to do #Inian
gawk 'BEGIN {FS=OFS="\t"} {
for (i=4;i<=NF;i++)
{
sum[i]+=$i; print $1,$2,$3,$i
}
}' "input_file"
Other way is to do for every column manually. $4,$5+$4,$6+$5+$4,$7+$6+$5+$4 and so on, but I think is a "seedy" method.

Following awk may help you here.
awk '{for(i=5;i<=NF;i++){$i+=$(i-1)}} 1' OFS="\t" Input_file

Related

Bash Awk: Median over windows with start and stop positions

I have a text file that looks like below. First column is location, second is position and third is value.
1 10 200
1 11 150
1 12 300
2 13 400
2 14 100
2 15 250
3 16 200
3 17 200
3 18 350
3 19 150
...
I would like to calculate the median of the value field over a certain window. For example lets say a window size of 4 rows. Below is the expected result for the sample data above:
1 2 10 13 250
2 3 14 17 200
...
For every window (4 rows), the first value (within window) of first column, last value (within window) of the first column, first value of the second column, last value of the second column and the median of third column is reported.
I have got it partially working. The script below prints last position of column 1, last position of column 2 and mean.
win=4
cat file.txt | awk -v win="$win" '{sum+=$3} (NR%win)==0 {print $1,$2,sum/win;sum=0}'
2 13 262.5
3 17 187.5
...
How do I get the initial positions within each window and median?
$ awk '{r=(NR-1)%4; a[r]=$3}
r==0{f1=$1; s1=$2}
r==3{asort(a); print f1,$1,s1,$2,(a[2]+a[3])/2; delete a}' file
1 2 10 13 250
2 3 14 17 200
note that delete is not really necessary since the values are overwritten at each window computation...
you can parameterize window size, need to handle odd/even
$ awk -v w=5 '{r=(NR-1)%w; a[r]=$3}
r==0{f1=$1; s1=$2}
r==(w-1){asort(a);
print f1,$1,s1,$2,(w%2?a[int(w/2)+1]:(a[w/2]+a[w/2+1])/2);
delete a}' file
1 2 10 14 200
2 3 15 19 200
doesn't handle if the last window is not full size

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Collapse sequential numbers to ranges in bash

I am trying to collapse sequential numbers to ranges in bash. For example, if my input file is
1
2
3
4
15
16
17
18
22
23
45
46
47
I want the output as:
1 4
15 18
22 23
45 47
How can I do this with awk or sed in a single line command?
Thanks for any help!
$ awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' file
1 4
15 18
22 23
45 47
Explanation
NR==1{first=$1;last=$1;next}
On the first line, initialize the variables first and last and skip to next line.
$1 == last+1 {last=$1;next}
If this line continues in the sequence from the last, update last and jump to the next line.
print first,last;first=$1;last=first
If we get here, we have a break in the sequence. Print out the range for the last sequence and reinitialize the variables for a new sequence.
END{print first,last}
After we get to the end of the file, print the final sequence.

Removing multiple block of lines of a text file in bash

Assume a text file with 40 lines of data. How can I remove lines 3 to 10, 13 to 20, 23 to 30, 33 to 40, in place using bash script?
I already know how to remove lines 3 to 10 with sed, but I wonder if there is a way to do all the removing, in place, with only one command line. I can use for loop but the problem is that with each iteration of loop the lines number will be changed and it needs some additional calculation of line numbers to be removed.
here is an awk oneliner, works for your needs no matter your file has 40 lines or 40k lines:
awk 'NR~/[12]$/' file
for example, with 50 lines:
kent$ seq 50|awk 'NR~/[12]$/'
1
2
11
12
21
22
31
32
41
42
sed -i '3,10d;13,20d;23,30d;33,40d' file
This might work for you (GNU sed):
sed '3~10,+7d' file
Deletes lines in the range of 3 and thereafter steps of 10 for the following 7 lines to be deleted.
If the file was longer than 40 lines and you were only interested in the first 40 lines:
sed '41,$b;3~10,+7d' file
The first instruction tells sed to ignore lines 41 to end-of-file.
Could also be written:
sed '1,40{3~10,+7d}' file
#Kent's answer is the way to go for this particular case, but in general:
$ seq 50 | awk '{idx=(NR%10)} idx>=1 && idx<=2'
1
2
11
12
21
22
31
32
41
The above will work even if you want to select the 4th through 7th lines out of every 13, for example:
$ seq 50 | awk '{idx=(NR%13)} idx>=4 && idx<=7'
4
5
6
7
17
18
19
20
30
31
32
33
43
44
45
46
its not constrained to N out of 10.
Or to select just the 3rd, 5th and 6th lines out of every 13:
$ seq 50 | awk 'BEGIN{split("3 5 6",tmp); for (i in tmp) tgt[tmp[i]]=1} tgt[NR%13]'
3
5
6
16
18
19
29
31
32
42
44
45
The point is - selecting ranges of lines is a job for awk, definitely not sed.
awk '{m=NR%10} !(m==0 || m>=3)' file > tmp && mv tmp file

Need help to formating data

I need your help to formatting my data. I have a data like below
Ver1
12 45
Ver2
134 23
Ver3
2345 980
ver4
21 1
ver36
213141222 22
....
...etc
I need my data like the below format
ver1 12 45
ver2 134 23
ver3 2345 980
ver4 21 1
etc.....
Also i want the total count of col 2 and 3 at the end of the output. Im not sure the scripts, if you provide simple script (May AWK can, but not sure).if possible please share the detailed answer to learn and understand.
$ awk 'NR%2{printf $0" ";next;}
{col1+=$1; col2+=$2} 1;
END{print "TOTAL col1="col1, "col2="col2}' file
Ver1 12 45
Ver2 134 23
Ver3 2345 980
ver4 21 1
ver36 213141222 22
TOTAL col1=213143734 col2=1071
It merges every two lines as solved by Kent. It also sums the 1st and 2nd column into col1 and col2 vars. Finally, it prints the value in the END {} block.

Resources