Delete matching lines in two tab delimited files - bash

I have 2 tab delimited files
A 2
A 5
B 4
B 5
C 10
and
A 2
A 5
B 5
I want to delete the lines in file1 that are in file2 so that the output is:
B 4
C 10
I have tried:
awk 'NR==FNR{c[$1$2]++;next};!c[$1$2] > 0' file2 file1 > file3
but it deletes more lines than expected.
1026997259 file1
1787919 file2
1023608359 file3
How can I modify this code, so that:
I have 2 tab delimited files
A 2 3
A 5 4
B 4 5
B 5 5
C 10 12
and
A 2 5
A 5 4
B 5 3
F 6 7
Based only in the 1st and 2nd columns, I want to grab the lines in file1 that are also in file2 so that the output is:
B 5 5
C 10 12

Why not to use grep command?
grep -vf file2 file1

Think about it - if you concatenate ab c and a cb they both become abc so what do you think your code is doing with $1$2? Use SUBSEP as intended ($1,$2) and change !c[$1$2] > 0 to !(($1,$2) in c). Also consider whether !c[$1$2] > 0 means !(c[$1$2] > 0) or (!c[$1$2]) > 0. I'd never write the former code so idk for sure, I'd always write it with parens as I intended it to be parsed. So do:
awk 'NR==FNR{c[$1,$2];next} !(($1,$2) in c)' file2 file1
Or just use $0 instead of $1,$2:
awk 'NR==FNR{c[$0];next} !($0 in c)' file2 file1

If the matching lines in the two files are identical, and the two files are sorted in the same order, then comm(1) can do the trick:
comm -23 file1 file2
It prints out lines that are only in the first file (unless -1 is given), lines that are only in the second file (unless -2), and lines that are in both files (unless -3). If you leave more than one option enabled then they will be printed in multiple (tab-separated) columns.

Related

Comparing the column of one file with the row of another file

I have a data of 314 files (names like file1 fil2 file3 ......). Each file have two columns and different rows.
Example Input file1
a 19
b 9
c 8
i 7
g 6
d 5
Example Input file2
a 19
i 7
g 6
d 5
I have an another file (data.txt) having 314 rows and each row have different number of columns
a d c g
a i
a d
d c
I want to compare Column 1 of file1 with the 1st row of data.txt file and simlarly Column 1 of file2 with the 2nd row of data.txt. and so on till column 1 of file314 with 314th row of the data.txt file.
My expected output is numbers of genes matched and mismatched for particular file and for particular row.
I am able to do it only with separate-separate files. How to do it i single command.
Expected output
Matched Mismatched
Ist_file_1st row 4 2
2nd_file_2nd row 2 2
.
.
314_file_314th row - -
The easiest way is the following:
awk '(FNR==NR){$1=$1; a[FNR]=OFS $0 OFS; next}
f && (FNR==1) { print f,m,nr-m }
(FNR==1){f++; nr=m=0}
{nr++; if(a[f] ~ OFS $1 OFS) m++ }
END { print f,m,nr-m }' data.txt f1.txt f2.txt ... f314.txt
For the data.txt and f1.txt and f2.txt mentioned in the OP, the following output is produced:
1 4 2
2 2 2
The first column represents the file number/row, the second column represents the total matches and the third the total mismatches.

Optimizing grep -f piping commands [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 4 years ago.
I have two files.
file1 has some keys that start have abc in the second column
et1 abc
et2 abc
et55 abc
file2 has the column 1 values and some other numbers I need to add up:
1 2 3 4 5 et1
5 5 5 5 5 et100
3 3 3 3 3 et55
5 5 5 5 4 et1
6 6 6 6 3 et1
For the keys extracted in file1, I need to add up the corresponding column 5 if it matches. File2 itself is very large
This command seems to be working but it is very slow:
egrep -isr "abc" file1.tcl | awk '{print $1}' | grep -vwf /dev/stdin file2.tcl | awk '{tl+=$5} END {print tl}'
How would I go about optimizing the pipe. Also what am I doing wrong with grep -f. Is it generally not recommended to do something like this.
Edit: Expected output is the sum of all column5 in file2 when the column6 key is present in file1
Edit2:Expected output: Since file 1 has keys "et1, et2 and et55", in file2 adding up the column 5 with matching keys in rows 1,3,4 and 5, the expected output is [5+3+4+3=15]
Use a single awk to read file1 into the keys of an array. Then when reading file2, add $5 to a total variable when $6 is in the array.
awk 'NR==FNR {if ($2 == "abc") a[$1] = 0;
next}
$6 in a {total += $5}
END { print total }
' file1.tcl file2.tcl
Could you please try following, with reading first Input_file2.tcl and with less loops. Since your expected output is not clear so haven't completely tested it.
awk 'FNR==NR{a[$NF]+=$(NF-1);next} $2=="abc"{print $1,a[$1]+0}' file2.tcl file1.tcl

Cross-referencing strings from two files by line number and collecting them into a third file

I have two files that I wish to coordinate into a single file for plotting an xy-graph.
File1 contains a different x-value on each line, followed by a series of y-values on the same line. File2 contains the specific y-value that I need from File1 for each point x.
In reality, I have 50,000 lines and 50-100 columns, but here is a simplified example.
File1 appears like this:
1 15 2 3 1
2 18 4 6 5
3 19 7 8 9
4 23 10 2 11
5 25 18 17 16
column 1 is the line number.
column 2 is my x-value, sorted in ascending order.
columns 3-5 are my y-values. They aren't unique; a y on one line could match a y on a different line.
File2 appears like this:
3
5
2
18
The y on each line in File2 corresponds to a number matching one of the y's in File1 from the same line (for the first few hundred lines). After the first few hundred lines, they may not always have a match. Therefore, File2 has fewer lines than File1. I would like to either ignore these rows or fill it with a 0.
Goal
The output, File3, should consist of:
15 3
18 5
19 0
23 2
25 18
or the line with
19 0
removed, whichever works for the script. If neither option is possible, then I would also be okay with just matching the y-values line-by-line until there is not a match, and then stopping there.
Attempts
I initially routed File2 into an array:
a=( $(grep -e '14,12|:*' File0 | cut -b 9-17) )
but then I noticed similar questions (1, 2) on Stackexchange used a second file, hence I routed the above grep command into File2.
These questions are slightly different, since I require specific columns from File1, but I thought I could at least use them as a starting point. The solutions to these questions:
1)
grep -Fwf File2 File1
reproduces of course the entire line in File1, and I'm not sure how to proceed from there; or
2)
awk 'FNR==NR {arr[$1];next} $1 in arr' File2 File1
fails entirely for me, with no error message except the general awk help response.
Is this possible to do? Thank you.
awk 'NR==FNR { arr[NR] = $1; next } {
for (i = 3; i <= NF; ++i) {
if ($i == arr[n]) {
print $2, $i
n++
next
}
}
print $2, 0
}' n=1 file2 file1
another awk, will print the first match only
$ awk 'NR==FNR {a[$1]; next}
{f2=$2; $1=$2="";
for(k in a) if($0 FS ~ FS k FS) {print f2,k; next}}' file2 file1
15 2
18 5
23 2
25 18
padded FS to eliminate sub-string matches. Note the order of the files, file2 should be provided first.

count patterns in a csv file from another csv file in bash

I have two csv files
File A
ID
1
2
3
File B
ID
1
1
1
1
3
2
3
What I want to do is to count how many times that a ID in File A show up in File B, and save the result in a new file C (which is in csv format). For example, 1 in File A shows up 4 times in File B. So in the new file C, I should have something like
File C
ID,Count
1,4
2,1
3,2
Originally I was thinking use "grep -f", but it seems like it only works with .txt format. Unfortunately, File A and B are both in csv format. So now, I am thinking maybe I could use a for loop to get the ID from File A individually and use grep -c to count each one of them. Any idea will be helpful.
Thanks in advance!
You can use this awk command:
awk -v OFS=, 'FNR==1{next} FNR==NR{a[$1]; next} $1 in a{freq[$1]++}
END{print "ID", "Count"; for (i in freq) print i, freq[i]}' fileA fileB
ID,Count
1,4
2,1
3,2
You could use join, sort, uniq and process substitution <(command) creatively:
$ join -2 2 <(sort A) <(sort B | uniq -c) | sort -n > C
$ cat C
ID 1
1 4
2 1
3 2
And if you really really want the header to be ID Count, before writing to file C you could replace that 1 with Count with sed by adding:
... | sed 's/\(ID \)1/\1Count/' > C
to get
ID Count
1 4
2 1
3 2
and if you really really want commas as separators instead of spaces, to replace them with spaces using tr, add also:
... | tr \ , > C
to get
ID,Count
1,4
2,1
3,2
You could of course ditch the trand use the sed like this instead:
... | sed 's/\(ID \)1/\1Count/;s/ /,/' > C
And the output would be like above.

Count how many occurences are greater or equel of a defined value in a line

I've a file (F1) with N=10000 lines, each line contains M=20000 numbers. I've an other file (F2) with N=10000 lines with only 1 column. How can count the number of occurences in line i of file F2 that are greater or equal to the number found at line i in the file F2 ? I tried using a bash loop with awk / sed but my output is empty.
Edit >
For now I've only succeed to print the number of occurences that are higher than a defined value. Here an example with a file with 3 lines and a defined value of 15 (sorry it's a very dirty code ..) :
for i in {1..3};do sed -n "$i"p tmp.txt | sed 's/\t/\n/g' | awk '{if($1 > 15){print $1}}' | wc -l; done;
Thanks in advance,
awk 'FNR==NR{a[FNR]=$1;next}
{count=0;for(i=1;i<=NF;i++)
{if($i >= a[FNR])
{count++}
};
print count
}' file2 file1
While processing file2, total line record is equal to line record of current file, store value in array a with current record as index.
initialize count to 0 for each line.
loop through the fields, increment the counter if value is greater or equal at current FNR index in array a.
Print the count value
$ cat file1
1 3 5 7 3 6
2 5 6 8 7 7
4 6 7 8 9 4
$ cat file2
6
3
1
$ awk -f file.awk
2
5
6
You could do it in a single awk command:
awk 'NR==FNR{a[FNR]=$1;next}{c=0;for(i=1;i<=NF;i++)c+=($i>a[FNR]);print c}' file2 file1

Resources