Comparing the column of one file with the row of another file - shell

I have a data of 314 files (names like file1 fil2 file3 ......). Each file have two columns and different rows.
Example Input file1
a 19
b 9
c 8
i 7
g 6
d 5
Example Input file2
a 19
i 7
g 6
d 5
I have an another file (data.txt) having 314 rows and each row have different number of columns
a d c g
a i
a d
d c
I want to compare Column 1 of file1 with the 1st row of data.txt file and simlarly Column 1 of file2 with the 2nd row of data.txt. and so on till column 1 of file314 with 314th row of the data.txt file.
My expected output is numbers of genes matched and mismatched for particular file and for particular row.
I am able to do it only with separate-separate files. How to do it i single command.
Expected output
Matched Mismatched
Ist_file_1st row 4 2
2nd_file_2nd row 2 2
.
.
314_file_314th row - -

The easiest way is the following:
awk '(FNR==NR){$1=$1; a[FNR]=OFS $0 OFS; next}
f && (FNR==1) { print f,m,nr-m }
(FNR==1){f++; nr=m=0}
{nr++; if(a[f] ~ OFS $1 OFS) m++ }
END { print f,m,nr-m }' data.txt f1.txt f2.txt ... f314.txt
For the data.txt and f1.txt and f2.txt mentioned in the OP, the following output is produced:
1 4 2
2 2 2
The first column represents the file number/row, the second column represents the total matches and the third the total mismatches.

Related

Compare names and numbers in two files and output more

For example, there are 2 files:
$ cat file1.txt
e 16
a 9
c 14
b 9
f 25
g 7
$ cat file2.txt
a 10
b 12
c 15
e 8
g 7
Сomparing these two files with the command(directory dir 1 contains file 1, in directory 2 respectively file 2) grep -xvFf "$dir2" "$dir1" | tee "$dir3" we get the following output in dir 3:
$ cat file3.txt
e 16
a 9
c 14
b 9
f 25
Now I need to essentially compare the output of file 3 and file 2 and output to file 3 only those results where the number next to the letter has become greater, if the number is equal to or less than the value in file 2, do not output these values to the 3rd file, that is the contents of file 3 should be like this:
$ cat file3.txt
e 16
f 25
{m,g}awk 'FNR < NR ? __[$!_]<+$NF : (__[$!_]=+$NF)<_' f2.txt f1.txt
e 16
f 25
if u really wanna clump it all in one shot :
mawk '(__[$!!(_=NF)]+= $_ * (NR==FNR)) < +$_' f2.txt f1.txt
One awk idea:
awk '
FNR==NR { a[$1]=$2; next } # 1st file: save line in array a[]
($1 in a) && ($2 > a[$1]) # 2nd file: print current line if 1st field is an index in array a[] *AND* 2nd field is greater than the corrsponding value from array a[]
!($1 in a) # 2nd file: print current line if 1st field is not an index in array a[]
' file2.txt file1.txt
This generates:
e 16
f 25

Cross-referencing strings from two files by line number and collecting them into a third file

I have two files that I wish to coordinate into a single file for plotting an xy-graph.
File1 contains a different x-value on each line, followed by a series of y-values on the same line. File2 contains the specific y-value that I need from File1 for each point x.
In reality, I have 50,000 lines and 50-100 columns, but here is a simplified example.
File1 appears like this:
1 15 2 3 1
2 18 4 6 5
3 19 7 8 9
4 23 10 2 11
5 25 18 17 16
column 1 is the line number.
column 2 is my x-value, sorted in ascending order.
columns 3-5 are my y-values. They aren't unique; a y on one line could match a y on a different line.
File2 appears like this:
3
5
2
18
The y on each line in File2 corresponds to a number matching one of the y's in File1 from the same line (for the first few hundred lines). After the first few hundred lines, they may not always have a match. Therefore, File2 has fewer lines than File1. I would like to either ignore these rows or fill it with a 0.
Goal
The output, File3, should consist of:
15 3
18 5
19 0
23 2
25 18
or the line with
19 0
removed, whichever works for the script. If neither option is possible, then I would also be okay with just matching the y-values line-by-line until there is not a match, and then stopping there.
Attempts
I initially routed File2 into an array:
a=( $(grep -e '14,12|:*' File0 | cut -b 9-17) )
but then I noticed similar questions (1, 2) on Stackexchange used a second file, hence I routed the above grep command into File2.
These questions are slightly different, since I require specific columns from File1, but I thought I could at least use them as a starting point. The solutions to these questions:
1)
grep -Fwf File2 File1
reproduces of course the entire line in File1, and I'm not sure how to proceed from there; or
2)
awk 'FNR==NR {arr[$1];next} $1 in arr' File2 File1
fails entirely for me, with no error message except the general awk help response.
Is this possible to do? Thank you.
awk 'NR==FNR { arr[NR] = $1; next } {
for (i = 3; i <= NF; ++i) {
if ($i == arr[n]) {
print $2, $i
n++
next
}
}
print $2, 0
}' n=1 file2 file1
another awk, will print the first match only
$ awk 'NR==FNR {a[$1]; next}
{f2=$2; $1=$2="";
for(k in a) if($0 FS ~ FS k FS) {print f2,k; next}}' file2 file1
15 2
18 5
23 2
25 18
padded FS to eliminate sub-string matches. Note the order of the files, file2 should be provided first.

Delete matching lines in two tab delimited files

I have 2 tab delimited files
A 2
A 5
B 4
B 5
C 10
and
A 2
A 5
B 5
I want to delete the lines in file1 that are in file2 so that the output is:
B 4
C 10
I have tried:
awk 'NR==FNR{c[$1$2]++;next};!c[$1$2] > 0' file2 file1 > file3
but it deletes more lines than expected.
1026997259 file1
1787919 file2
1023608359 file3
How can I modify this code, so that:
I have 2 tab delimited files
A 2 3
A 5 4
B 4 5
B 5 5
C 10 12
and
A 2 5
A 5 4
B 5 3
F 6 7
Based only in the 1st and 2nd columns, I want to grab the lines in file1 that are also in file2 so that the output is:
B 5 5
C 10 12
Why not to use grep command?
grep -vf file2 file1
Think about it - if you concatenate ab c and a cb they both become abc so what do you think your code is doing with $1$2? Use SUBSEP as intended ($1,$2) and change !c[$1$2] > 0 to !(($1,$2) in c). Also consider whether !c[$1$2] > 0 means !(c[$1$2] > 0) or (!c[$1$2]) > 0. I'd never write the former code so idk for sure, I'd always write it with parens as I intended it to be parsed. So do:
awk 'NR==FNR{c[$1,$2];next} !(($1,$2) in c)' file2 file1
Or just use $0 instead of $1,$2:
awk 'NR==FNR{c[$0];next} !($0 in c)' file2 file1
If the matching lines in the two files are identical, and the two files are sorted in the same order, then comm(1) can do the trick:
comm -23 file1 file2
It prints out lines that are only in the first file (unless -1 is given), lines that are only in the second file (unless -2), and lines that are in both files (unless -3). If you leave more than one option enabled then they will be printed in multiple (tab-separated) columns.

How to print and store specific named columns from csv file with new row numbers

start by saying, I'm very new to using bash and any sort of script writing in general.
I have a csv file that has basic column headers and values underneath which looks something like this as an example:
a b c d
3 3 34 4
2 5 4 94
4 5 8 3
9 8 5 7
Is there a way to extract only the numerical values from a specific column and add a number for each row. For example first numbered row of the first column (starting from 1 after the column header) is 1, then 2, then 3, etc, for example for column b the output would be:
1 3
2 5
3 5
4 8
I would like to be able to do this for various different named column headers.
Any help would be appreciated,
Chris
Like this? Using awk:
$ awk 'NR>1{print NR-1, $2}' file
1 3
2 5
3 5
4 8
Explained:
$ awk ' # using awk for the job
NR>1 { # for the records or rows after the first
print NR-1, $2 # output record number minus one and the second field or column
}' file # state the file
I would like to be able to do this for various different named column headers. With awk you don't specify the column header name but the column number, like you don't state b but $2.
awk 'NR>1 {print i=1+i, $2}' file
NR>1 skips the first line, in your case the header.
print print following
i=1+i prints i, i is first 0 and add 1, so i is 1, next time 2 and so on.
$2 prints the second column.
file is the path to your file.
If you have a simple multi-space delimited file (as in your example) awk is the best tool for the job. To select the column by name in awk you can do something like:
$ awk -v col="b" 'FNR==1 { for (i=1;i<=NF;i++) if ($i==col) x=i; next }
{print FNR-1 OFS $x}' file
1 3
2 5
3 5
4 8

Extract column after pattern from file

I have a sample file which looks like this:
5 6 7 8
55 66 77 88
A B C D
1 2 3 4
2 4 6 8
3 8 12 16
E F G H
11 22 33 44
and so on...
I would like to enter a command in a bash script or just in a bash terminal to extract one of the columns independently of the others. For instance, I would like to do something like a grep/awk command with the pattern=C and get the following output:
C
3
6
12
How can I extract a specific column independent of the others and also put a # of lines to extract after the pattern so that I don't get the above column with the 7's or the G column in my output?
If it's always 3 records after the found term:
awk '{for(i=1;i<=NF;i++) {if($i=="C") col=i}} col>0 && rcount<=3 {print $col; rcount++}' test
This will look at each field in your record and if it finds a "C", it will capture the column number i. If the column number is greater than 0 then it will print the contents of the column. It counts up to 3 records and then stops printing.
$ cat tst.awk
!prevNF { delete f; for (i=1; i<=NF; i++) f[$i] = i }
NF && (tgt in f) { print $(f[tgt]) }
{ prevNF = NF }
$ awk -v tgt=C -f tst.awk file
C
3
6
12
$ awk -v tgt=F -f tst.awk file
F
22

Resources