Compare names and numbers in two files and output more - bash

For example, there are 2 files:
$ cat file1.txt
e 16
a 9
c 14
b 9
f 25
g 7
$ cat file2.txt
a 10
b 12
c 15
e 8
g 7
Сomparing these two files with the command(directory dir 1 contains file 1, in directory 2 respectively file 2) grep -xvFf "$dir2" "$dir1" | tee "$dir3" we get the following output in dir 3:
$ cat file3.txt
e 16
a 9
c 14
b 9
f 25
Now I need to essentially compare the output of file 3 and file 2 and output to file 3 only those results where the number next to the letter has become greater, if the number is equal to or less than the value in file 2, do not output these values to the 3rd file, that is the contents of file 3 should be like this:
$ cat file3.txt
e 16
f 25

{m,g}awk 'FNR < NR ? __[$!_]<+$NF : (__[$!_]=+$NF)<_' f2.txt f1.txt
e 16
f 25
if u really wanna clump it all in one shot :
mawk '(__[$!!(_=NF)]+= $_ * (NR==FNR)) < +$_' f2.txt f1.txt

One awk idea:
awk '
FNR==NR { a[$1]=$2; next } # 1st file: save line in array a[]
($1 in a) && ($2 > a[$1]) # 2nd file: print current line if 1st field is an index in array a[] *AND* 2nd field is greater than the corrsponding value from array a[]
!($1 in a) # 2nd file: print current line if 1st field is not an index in array a[]
' file2.txt file1.txt
This generates:
e 16
f 25

Related

Comparing the column of one file with the row of another file

I have a data of 314 files (names like file1 fil2 file3 ......). Each file have two columns and different rows.
Example Input file1
a 19
b 9
c 8
i 7
g 6
d 5
Example Input file2
a 19
i 7
g 6
d 5
I have an another file (data.txt) having 314 rows and each row have different number of columns
a d c g
a i
a d
d c
I want to compare Column 1 of file1 with the 1st row of data.txt file and simlarly Column 1 of file2 with the 2nd row of data.txt. and so on till column 1 of file314 with 314th row of the data.txt file.
My expected output is numbers of genes matched and mismatched for particular file and for particular row.
I am able to do it only with separate-separate files. How to do it i single command.
Expected output
Matched Mismatched
Ist_file_1st row 4 2
2nd_file_2nd row 2 2
.
.
314_file_314th row - -
The easiest way is the following:
awk '(FNR==NR){$1=$1; a[FNR]=OFS $0 OFS; next}
f && (FNR==1) { print f,m,nr-m }
(FNR==1){f++; nr=m=0}
{nr++; if(a[f] ~ OFS $1 OFS) m++ }
END { print f,m,nr-m }' data.txt f1.txt f2.txt ... f314.txt
For the data.txt and f1.txt and f2.txt mentioned in the OP, the following output is produced:
1 4 2
2 2 2
The first column represents the file number/row, the second column represents the total matches and the third the total mismatches.

Adding the last number in each file to the numbers in the following file

I have some directories, each of these contains a file with list of integers 1-N which are not necessarily consecutive and they may be different lengths. What I want to achieve is a single file with a list of all those integers as though they had been generated in one list.
What I am trying to do is to add the final value N from file 1 to all the values in file 2, then take the new final value of file 2 and add it to all the values in file 3 etc.
I have tried this by setting a counter and looping over the files, resetting the counter when I get to the end of the file. The problem is the p=0 will continue to reset which is kind of obvious in the code but I am not sure how else to do it.
What I tried:
p=0
for i in dirx/dir_*; do
(cd "$i" || exit;
awk -v p=$p 'NR>1{print last+p} {last=$0} END{$0=last; p=last; print}' file >> /someplace/bigfile)
done
Which is similar to the answer suggested in this question Replacing value in column with another value in txt file using awk
Now I'm wondering whether I need an if else, if it's the first dir then p=0 if not then p=last value from the first file though I'm not sure on that or how I'd get it to take the last value. I used awk because that's what I understand a small amount of and would usually use.
With GNU awk
gawk '{print $1 + last} ENDFILE {last = last + $1}' file ...
Demo:
$ cat a
1
2
4
6
8
$ cat b
2
3
5
7
$ cat c
1
2
3
$ gawk '{print $1 + last} ENDFILE {last = last + $1}' a b c
1
2
4
6
8
10
11
13
15
16
17
18

awk to remove duplicate rows totally based on a particular column value

I got a dataset like:
6 AA_A_56_30018678_E 0 30018678 P A
6 SNP_A_30018678 0 30018678 A G
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
6 AA_A_62_30018696_Q 0 30018696 P A
6 AA_A_62_30018696_G 0 30018696 P A
6 AA_A_62_30018696_R 0 30018696 P A
I want to remove all the rows if col 4 have duplicates.
I have use the below codes (using sort, awk,uniq and join...) to get the required output, however, is there a better way to do this?
sort -k4,4 example.txt | awk '{print $4}' | uniq -u > snp_sort.txt
join -1 1 -2 4 snp_sort.txt example.txt | awk '{print $3,$5,$6,$1}' > uniq.txt
Here is the output
SNP_A_30018679 T G 30018679
SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695
Using awk to filter-out duplicate lines and print those lines which occur exactly once.
awk '{k=($2 FS $5 FS $6 FS $4)} {a[$4]++;b[$4]=k}END{for(x in a)if(a[x]==1)print b[x]}' input_file
SNP_A_30018682 T G 30018682
SNP_A_30018695 G C 30018695
SNP_A_30018679 T G 30018679
The idea is to:-
Store all unique $4 entries in a an array(a) and maintain a counter for that in array b
Print the array for those entries which occur exactly once.
Using command substitution: First print only unique columns in fourth field and then grep those columns.
grep "$(echo "$(awk '{print $4}' inputfile.txt)" |sort |uniq -u)" inputfile.txt
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Note: add awk '{NF=4}1' at the end of the command, if you wist to print first four columns. Of course you can change the number of columns by changing value of $4 and NF=4.
$ awk 'NR==FNR{c[$4]++;next} c[$4]<2' file file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Since your 'key' is fixed width, then uniq has a -w to check on it.
sort -k4,4 example.txt | uniq -u -f 3 -w 8 > uniq.txt
Another in awk:
$ awk '{$1=$1; a[$4]=a[$4] $0} END{for(i in a) if(gsub(FS,FS,a[i])==5) print a[i]}' file
6 SNP_A_30018679 0 30018679 T G
6 SNP_A_30018682 0 30018682 T G
6 SNP_A_30018695 0 30018695 G C
Catenate to array using $4 as key. If there are more than 5 field separators, duplicates were catenated and will not be printed.
And yet an another version in awk. It expects the file to be sorted on the fourth field. It won't store all lines in memory, only the keys (this probably could be dealt with also since the key field must be sorted, may be fixed later) and runs in one go:
$ cat ananother.awk
++seen[p[4]]==1 && NR>1 && p[4]!=$4 { # seen count must be 1 and
print prev # this and previous $4 must differ
delete seen # is this enough really?
}
{
q=p[4] # previous previous $4 for END
prev=$0 # previous is stored for printing
split($0,p) # to get previous $4
}
END { # last record control
if(++seen[$4]==1 && q!=$4)
print $0
}
Run:
$ sort -k4,4 file | awk -f ananother.awk
A simpler way to achieve this,
cat file.csv | cut -d, -f3,5,6,1 | sort -u > uniq.txt

Extract column after pattern from file

I have a sample file which looks like this:
5 6 7 8
55 66 77 88
A B C D
1 2 3 4
2 4 6 8
3 8 12 16
E F G H
11 22 33 44
and so on...
I would like to enter a command in a bash script or just in a bash terminal to extract one of the columns independently of the others. For instance, I would like to do something like a grep/awk command with the pattern=C and get the following output:
C
3
6
12
How can I extract a specific column independent of the others and also put a # of lines to extract after the pattern so that I don't get the above column with the 7's or the G column in my output?
If it's always 3 records after the found term:
awk '{for(i=1;i<=NF;i++) {if($i=="C") col=i}} col>0 && rcount<=3 {print $col; rcount++}' test
This will look at each field in your record and if it finds a "C", it will capture the column number i. If the column number is greater than 0 then it will print the contents of the column. It counts up to 3 records and then stops printing.
$ cat tst.awk
!prevNF { delete f; for (i=1; i<=NF; i++) f[$i] = i }
NF && (tgt in f) { print $(f[tgt]) }
{ prevNF = NF }
$ awk -v tgt=C -f tst.awk file
C
3
6
12
$ awk -v tgt=F -f tst.awk file
F
22

Count how many occurences are greater or equel of a defined value in a line

I've a file (F1) with N=10000 lines, each line contains M=20000 numbers. I've an other file (F2) with N=10000 lines with only 1 column. How can count the number of occurences in line i of file F2 that are greater or equal to the number found at line i in the file F2 ? I tried using a bash loop with awk / sed but my output is empty.
Edit >
For now I've only succeed to print the number of occurences that are higher than a defined value. Here an example with a file with 3 lines and a defined value of 15 (sorry it's a very dirty code ..) :
for i in {1..3};do sed -n "$i"p tmp.txt | sed 's/\t/\n/g' | awk '{if($1 > 15){print $1}}' | wc -l; done;
Thanks in advance,
awk 'FNR==NR{a[FNR]=$1;next}
{count=0;for(i=1;i<=NF;i++)
{if($i >= a[FNR])
{count++}
};
print count
}' file2 file1
While processing file2, total line record is equal to line record of current file, store value in array a with current record as index.
initialize count to 0 for each line.
loop through the fields, increment the counter if value is greater or equal at current FNR index in array a.
Print the count value
$ cat file1
1 3 5 7 3 6
2 5 6 8 7 7
4 6 7 8 9 4
$ cat file2
6
3
1
$ awk -f file.awk
2
5
6
You could do it in a single awk command:
awk 'NR==FNR{a[FNR]=$1;next}{c=0;for(i=1;i<=NF;i++)c+=($i>a[FNR]);print c}' file2 file1

Resources