Print rows whose first field appears exactly twice in the file

Print rows whose first field appears exactly twice in the file - bash

I have a file like this:
91052011868;Export Equi_Fort Postal;EXPORT;23/02/2015;1;0;0
91052011868;Sof_equi_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052011868;Sof_trav_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052151371;Export Trav_faible temoin;EXPORT;12/02/2015;1;0;0
91052182019;Export Deme_fort temoin;EXPORT;24/02/2015;1;0;0
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
91052262558;Sof_deme_faible_Email_am;EMAIL;26/01/2015;1;0;1
91052265940;Sof_trav_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_trav_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;17/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;16/02/2015;1;0;0
91052531428;Export Trav_faible temoin;EXPORT;11/02/2015;1;0;0
91052547697;Export Deme_Faible Postal;EXPORT;27/02/2015;1;0;0
91052562398;Export Deme_faible temoin;EXPORT;18/02/2015;1;0;0
I want to know all the lines where the first column duplicated values are greater than 1 but strictly inferior to 3.
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
I did the part below but it doesn't work...
sort file | awk 'NR==FNR{a[$1]++;next;}{ if (a[$1] > 0 && a[$1] <1 )print $0;}' file file
Why?

If what you want is to print all those lines whose first field appears twice, you can use this:
$ awk -F";" 'FNR==NR{a[$1]++; next} a[$1]==2' file file
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
This sets the field separator to the semi colon and then reads the file twice:
- the first time to count how many the 1st field appears (a[$1]++)
- the second time to print those lines matching the condition a[$1]==2. That is, the first field to appearing twice throughout the file.
If you wanted those indexes appearing between 2 and 4 times, you could use the following syntax on the second block:
a[$1]>=2 && a[$1]<=4
Why wasn't your approach working?
Because your condition says:
if (a[$1] > 0 && a[$1] <1 )
which of course will never happen, since a[$1] is an integer and no integer is bigger than 0 and smaller than 1.
Note my proposed solution uses the same idea, only that in a bit more idiomatic way: There is no need to be explicit in the if condition, neither saying print $0: this is exactly what awk does when a condition evaluates as True.

Related

Take mean of columns in text file by every 10-row blocks in bash

I have a tab delimited text file with two columns without header. Now, I want to take the mean of each column within blocks of 10 rows. This means, I take the first 10 rows, take the mean between the 10 numbers in each column and output the mean of each column into another text file. Now go further, take the next 10 rows and make the same again. Until the end of the file. If there are less than 10 rows left at the end, just take the mean of the left rows.
Input file:
0.32832977 3.50941E-10
0.31647876 3.38274E-10
0.31482627 3.36508E-10
0.31447645 3.36134E-10
0.31447645 3.36134E-10
0.31396809 3.35591E-10
0.31281157 3.34354E-10
0.312004 3.33491E-10
0.31102326 3.32443E-10
0.30771822 3.2891E-10
0.30560062 3.26647E-10
0.30413213 3.25077E-10
0.30373717 3.24655E-10
0.29636685 3.16777E-10
0.29622422 3.16625E-10
0.29590765 3.16286E-10
0.2949896 3.15305E-10
0.29414582 3.14403E-10
0.28841901 3.08282E-10
0.28820667 3.08055E-10
0.28291832 3.02403E-10
0.28243792 3.01889E-10
0.28156429 3.00955E-10
0.28043638 2.9975E-10
0.27872239 2.97918E-10
0.27833349 2.97502E-10
0.27825573 2.97419E-10
0.27669023 2.95746E-10
0.27645657 2.95496E-10
Expected output text file:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.987864E-10
I tried this code, but i don't know how to include the loop for each 10th row:
awk '{x+=$1;next}END{print x/NR}' file

Here is an awk to do this:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{print sum1/m,sum2/m; lfnr=FNR; next}
END{print sum1/(FNR-lfnr),sum2/(FNR-lfnr)}' file
Prints:
0.314611 3.36278e-10
0.296773 3.17211e-10
0.279535 2.98786e-10
Or if you want the same number of decimals you have, you can use printf:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{printf("%0.9G%s%0.9G\n",sum1/m,OFS,sum2/m); lfnr=FNR; next}
END{printf("%0.9G%s%0.9G\n",sum1/(FNR-lfnr),OFS,sum2/(FNR-lfnr))}' file
Prints:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.98786444E-10
Your superpower here is the % modulo operator which allows you to detect ever m step -- in this case every 10th. Your x-ray vision is the FNR awk special variable which is the line of the file you are reading.
FNR%10 is always less than 10 and when 0 you are on the 10th iteration and time to print. When 1 you are on the first iteration and it is time to reset the sums.

Append value to line in input file based on column number

I have a function append_val_to_line which appends the $append_val to the line in Input.txt and writes it back. I want to insert 45 at column 1 and it should and since its length is two values, it should at it at column 1 and column 0. But my code below is adding it at column 2 and column 3.
I am not sure why it is so. Can someone help me achieve the goal mentioned above?
My working solution is below but it does not add the 45 at column 0 and 1. I want a generic solution as I can insert the value at any column and the value should be added starting at column N to column N-1 ... N-2 based on the length of the append_val.
As you can see in the sample input before and sample input after the call to append_val_to_line function, the value 45 is added at column 1 and column 2 but I wanted to the from column 0 and end on column 1 as the value 45 is of length two. But my code adds it starting at column 2 and 3 instead.
Space is also a valid column number but I will not be adding any values in those spaces in a line.
#! /bin/bash
function append_val_to_line{
sed -i 's/\(.\{'$1'\}\)/\1'$2'/' "input.txt"
}
column_num=1
append_val=45
append_vals_to_line "$column_num" "$append_val"
Input.txt BEFORE call to function append_vals_to_line
1200 5600 775000 34555
Input.txt AFTER call to function append_vals_to_line
145200 5600 775000 34555
Note 45 has been added at column 2 and 3.

Since OP told OP wants to start character's position from 0 and I believe rather than column its character's position number which we are talking about here, so based on that and shown samples following may help then.
awk -v after="0" -v value="45" '{print substr($0,1,after+1) value substr($0,after+2)}' Input_file
Non one liner form of above:
awk -v after="0" -v value="45" '
{
print substr($0,1,after+1) value substr($0,after+2)
}
' Input_file
Explanation: Adding detailed explanation for above one.
awk -v after="0" -v value="45" ' ##Start awk prorgam from heer and setting after variable to 0 a per OP and value to 45.
{
print substr($0,1,after+1) value substr($0,after+2) ##Printing sub-string from 1 to till after+1 value since OP wants to insert value to 2nd character so printing 1st character here. Then I a printing value here, then printing rest of the current Line.
}
' Input_file ##Mentioning Input_file name here.

Sum of a column till certain count

I have a file ABC.txt which contain two columns. First column refer to the count and second column refer to the subscriber as below :-
1852 919474214491
1558 919475591746
1149 919475594574
1 919466423350
I have a variable in a script which shows some numeric value i.e Count is 3500.
So I want to compare the variable with first column in ABC.txt file. If value in first column is less than variable than move the value in second column in a separate file (123.txt). Go to next row, now add 1852 with 1558 and compare again with variable, if it is less than variable then move value in second column in file 123.txt. But if the sum of count is more than variable then stop.

Really easy to do with awk:
$ awk -v count=3500 '{ total += $1 } total >= count { exit } { print $2 }' ABC.txt
919474214491
919475591746

Awk: how to compare two strings in one line

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines

Add the values of each line of only one column from many files with 2 columns - shell scripting

I have many files that have this structure that have two columns of numbers. And I want to add each line value of the second column, for all of my files, so I'll end up with only one file. Can anyone help? Hope the question was clear enough. Thanks.

The following is based on the information OP provided in his comments here above:
We have multiple files and we have to sum the second column of each of these files. As far as we know we could have hundreds or thousands of different files
The first column in each file seems not important and I'm going to assume (based on OP sample data) we have the same (first) column in each input file
The basic idea is to start with an empty summary (file tot), paste one after the other each file with tot and sum 2 and 4 columns (if present) into the second column of the new tot file.
In other words...
$ touch tot ; for f in * ; do paste tot ${f} | awk '{ if ( NF > 3 ) { print $1, $2+$4 } else { print $1, $2 } }' > tmp ; mv tmp tot ; done
I did test it with 8 different files and seems to work as expected.
Of course for f in * has to be changed in order to capture ALL and ONLY the files we want to sum.

Assuming what you want is the sum of all values of the second column of each file, it looks like a simple enough job for awk:
cat files | awk '{ sum += $2 } END { print sum }'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Print rows whose first field appears exactly twice in the file - bash

Related

Take mean of columns in text file by every 10-row blocks in bash

Append value to line in input file based on column number

Sum of a column till certain count

Awk: how to compare two strings in one line

Add the values of each line of only one column from many files with 2 columns - shell scripting

Categories

Resources