Sum of a column till certain count - bash

I have a file ABC.txt which contain two columns. First column refer to the count and second column refer to the subscriber as below :-
1852 919474214491
1558 919475591746
1149 919475594574
1 919466423350
I have a variable in a script which shows some numeric value i.e Count is 3500.
So I want to compare the variable with first column in ABC.txt file. If value in first column is less than variable than move the value in second column in a separate file (123.txt). Go to next row, now add 1852 with 1558 and compare again with variable, if it is less than variable then move value in second column in file 123.txt. But if the sum of count is more than variable then stop.

Really easy to do with awk:
$ awk -v count=3500 '{ total += $1 } total >= count { exit } { print $2 }' ABC.txt
919474214491
919475591746

Related

Take mean of columns in text file by every 10-row blocks in bash

I have a tab delimited text file with two columns without header. Now, I want to take the mean of each column within blocks of 10 rows. This means, I take the first 10 rows, take the mean between the 10 numbers in each column and output the mean of each column into another text file. Now go further, take the next 10 rows and make the same again. Until the end of the file. If there are less than 10 rows left at the end, just take the mean of the left rows.
Input file:
0.32832977 3.50941E-10
0.31647876 3.38274E-10
0.31482627 3.36508E-10
0.31447645 3.36134E-10
0.31447645 3.36134E-10
0.31396809 3.35591E-10
0.31281157 3.34354E-10
0.312004 3.33491E-10
0.31102326 3.32443E-10
0.30771822 3.2891E-10
0.30560062 3.26647E-10
0.30413213 3.25077E-10
0.30373717 3.24655E-10
0.29636685 3.16777E-10
0.29622422 3.16625E-10
0.29590765 3.16286E-10
0.2949896 3.15305E-10
0.29414582 3.14403E-10
0.28841901 3.08282E-10
0.28820667 3.08055E-10
0.28291832 3.02403E-10
0.28243792 3.01889E-10
0.28156429 3.00955E-10
0.28043638 2.9975E-10
0.27872239 2.97918E-10
0.27833349 2.97502E-10
0.27825573 2.97419E-10
0.27669023 2.95746E-10
0.27645657 2.95496E-10
Expected output text file:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.987864E-10
I tried this code, but i don't know how to include the loop for each 10th row:
awk '{x+=$1;next}END{print x/NR}' file
Here is an awk to do this:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{print sum1/m,sum2/m; lfnr=FNR; next}
END{print sum1/(FNR-lfnr),sum2/(FNR-lfnr)}' file
Prints:
0.314611 3.36278e-10
0.296773 3.17211e-10
0.279535 2.98786e-10
Or if you want the same number of decimals you have, you can use printf:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{printf("%0.9G%s%0.9G\n",sum1/m,OFS,sum2/m); lfnr=FNR; next}
END{printf("%0.9G%s%0.9G\n",sum1/(FNR-lfnr),OFS,sum2/(FNR-lfnr))}' file
Prints:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.98786444E-10
Your superpower here is the % modulo operator which allows you to detect ever m step -- in this case every 10th. Your x-ray vision is the FNR awk special variable which is the line of the file you are reading.
FNR%10 is always less than 10 and when 0 you are on the 10th iteration and time to print. When 1 you are on the first iteration and it is time to reset the sums.

Append value to line in input file based on column number

I have a function append_val_to_line which appends the $append_val to the line in Input.txt and writes it back. I want to insert 45 at column 1 and it should and since its length is two values, it should at it at column 1 and column 0. But my code below is adding it at column 2 and column 3.
I am not sure why it is so. Can someone help me achieve the goal mentioned above?
My working solution is below but it does not add the 45 at column 0 and 1. I want a generic solution as I can insert the value at any column and the value should be added starting at column N to column N-1 ... N-2 based on the length of the append_val.
As you can see in the sample input before and sample input after the call to append_val_to_line function, the value 45 is added at column 1 and column 2 but I wanted to the from column 0 and end on column 1 as the value 45 is of length two. But my code adds it starting at column 2 and 3 instead.
Space is also a valid column number but I will not be adding any values in those spaces in a line.
#! /bin/bash
function append_val_to_line{
sed -i 's/\(.\{'$1'\}\)/\1'$2'/' "input.txt"
}
column_num=1
append_val=45
append_vals_to_line "$column_num" "$append_val"
Input.txt BEFORE call to function append_vals_to_line
1200 5600 775000 34555
Input.txt AFTER call to function append_vals_to_line
145200 5600 775000 34555
Note 45 has been added at column 2 and 3.
Since OP told OP wants to start character's position from 0 and I believe rather than column its character's position number which we are talking about here, so based on that and shown samples following may help then.
awk -v after="0" -v value="45" '{print substr($0,1,after+1) value substr($0,after+2)}' Input_file
Non one liner form of above:
awk -v after="0" -v value="45" '
{
print substr($0,1,after+1) value substr($0,after+2)
}
' Input_file
Explanation: Adding detailed explanation for above one.
awk -v after="0" -v value="45" ' ##Start awk prorgam from heer and setting after variable to 0 a per OP and value to 45.
{
print substr($0,1,after+1) value substr($0,after+2) ##Printing sub-string from 1 to till after+1 value since OP wants to insert value to 2nd character so printing 1st character here. Then I a printing value here, then printing rest of the current Line.
}
' Input_file ##Mentioning Input_file name here.

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

Awk: how to compare two strings in one line

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines

Print rows whose first field appears exactly twice in the file

I have a file like this:
91052011868;Export Equi_Fort Postal;EXPORT;23/02/2015;1;0;0
91052011868;Sof_equi_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052011868;Sof_trav_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052151371;Export Trav_faible temoin;EXPORT;12/02/2015;1;0;0
91052182019;Export Deme_fort temoin;EXPORT;24/02/2015;1;0;0
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
91052262558;Sof_deme_faible_Email_am;EMAIL;26/01/2015;1;0;1
91052265940;Sof_trav_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_trav_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;17/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;16/02/2015;1;0;0
91052531428;Export Trav_faible temoin;EXPORT;11/02/2015;1;0;0
91052547697;Export Deme_Faible Postal;EXPORT;27/02/2015;1;0;0
91052562398;Export Deme_faible temoin;EXPORT;18/02/2015;1;0;0
I want to know all the lines where the first column duplicated values are greater than 1 but strictly inferior to 3.
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
I did the part below but it doesn't work...
sort file | awk 'NR==FNR{a[$1]++;next;}{ if (a[$1] > 0 && a[$1] <1 )print $0;}' file file
Why?
If what you want is to print all those lines whose first field appears twice, you can use this:
$ awk -F";" 'FNR==NR{a[$1]++; next} a[$1]==2' file file
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
This sets the field separator to the semi colon and then reads the file twice:
- the first time to count how many the 1st field appears (a[$1]++)
- the second time to print those lines matching the condition a[$1]==2. That is, the first field to appearing twice throughout the file.
If you wanted those indexes appearing between 2 and 4 times, you could use the following syntax on the second block:
a[$1]>=2 && a[$1]<=4
Why wasn't your approach working?
Because your condition says:
if (a[$1] > 0 && a[$1] <1 )
which of course will never happen, since a[$1] is an integer and no integer is bigger than 0 and smaller than 1.
Note my proposed solution uses the same idea, only that in a bit more idiomatic way: There is no need to be explicit in the if condition, neither saying print $0: this is exactly what awk does when a condition evaluates as True.

Resources