Compare all but last N Columns across two files in bash - bash

I have 2 files: one with 18 columns; another with many more. I need to find the rows that mismatch on ONLY the first 18 columns while ignoring the rest in the other file. However, I need to preserve and print the entire row (cut will not work).
File 1:
F1 F2 F3....F18
A B C.... Y
AA BB CC... YY
File 2:
F1 F2 F3... F18... F32
AA BB CC... YY... 123
AAA BBB CCC... YYY...321
Output Not In File 1:
AAA BBB CCC YYY...321
Output Not In File 2:
A B C...Y
If possible, I would like to use diff or awk with as few loops as possible.

You can use awk:
awk '{k=""; for(i=1; i<=18; i++) k=k SUBSEP $i} FNR==NR{a[k]; next} !(k in a)' file1 file2
For each row in both files we are first creating a key by concatenating first 18 fields
We are then storing this key in an associative array while iterating first file
Finally we print each row from 2nd file when this new key value is not found in our associative array.

You can use grep:
grep -vf file1 file2
grep -vf <(cut -d" " -f1-18 file2) file1

to get set differences between two files, you'll need little more, similar to #anubhava's answer
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else f2[$0]}
END{print "not in f1";
for(k in f2) print k;
print "\nnot in f2";
for(k in f1) print k}' file1 file2
can be re-written to preserve order in file2
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else {if(!p) print "not in f1";
f2[$0]; print; p=1}}
END{print "\nnot in f2";
for(k in f1) print k}' file1 file2

Related

merge 2 files based on 1,2,3 column matching in 2 files & print specific columns from 2 files. also retain non-pairing rows from both files in output [duplicate]

file1
rs12345 G C
rs78901 A T
file2
3 22745180 rs12345 G C,G
12 67182999 rs78901 A G,T
desired output
3 22745180 rs12345 G C
12 67182999 rs78901 A T
I tried
awk 'NR==FNR {h[$1] = $3; next} {print $1,$2,$3,h[$2]}' file1 file2
output generated
3 22745180 rs12345
print first 4 columns of file2 and 3rd column of file1 as 5th col in output
You may use this awk:
awk 'FNR == NR {map[$1,$2] = $3; next} ($3,$4) in map {$NF = map[$3,$4]} 1' f1 f2 | column -t
3 22745180 rs12345 G C
12 67182999 rs78901 A T
A more readable version:
awk '
FNR == NR {
map[$1,$2] = $3
next
}
($3,$4) in map {
$NF = map[$3,$4]
}
1' file1 file2 | column -t
Used column -t for tabular output only.
In the case you presented (rows in both files are matched) this will work
paste file1 file2 | awk '{print $4,$5,$6,$7,$3}'

Splitting csv file into multiple files with 2 columns in each file

I am trying to split a file (testfile.csv) that contains the following:
1,2,4,5,6,7,8,9
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
into a file
1,2
a,b
q,w
a,s
z,x
and another file
4,5
c,d
e,r
d,f
c,v
but I cannot seem to do that in awk using an iterative solution.
awk -F, '{print $1, $2}'
awk -F, '{print $3, $4}'
does it for me but I would like a looping solution.
I tried
awk -F, '{ for (i=1;i< NF;i+=2) print $i, $(i+1) }' testfile.csv
but it gives me a single column. It appears that I am iterating over the first row and then moving onto the second row skipping every other element of that specific row.
You can use cut:
$ cut -d, -f1,2 file > file_1
$ cut -d, -f3,4 file > file_2
If you are going to use awk be sure to set the OFS so that the columns remain a CSV file:
$ awk 'BEGIN{FS=OFS=","}
{print $1,$2 >"f1"; print $3,$4 > "f2"}' file
$ cat f1
1,2
a,b
q,w
a,s
z,x
$cat f2
4,5
c,d
e,r
d,f
c,v
Is there a quick and dirty way of renaming the resulting files with the first row and first column (like first file would be 1.csv, second file would be 4.csv:
awk 'BEGIN{FS=OFS=","}
FNR==1 {n1=$1 ".csv"; n2=$3 ".csv"}
{print $1,$2 >n1; print $3,$4 > n2}' file
awk -F, '{ for (i=1; i < NF; i+=2) print $i, $(i+1) > i ".csv"}' tes.csv
works for me. I was trying to get the output in bash which was all jumbled up.
It's do-able in bash, but it will be much slower than awk:
f=testfile.csv
IFS=, read -ra first < <(head -1 "$f")
for ((i = 0; i < (${#first[#]} + 1) / 2; i++)); do
slice_file="${f%.csv}$((i+1)).csv"
cut -d, -f"$((2 * i + 1))-$((2 * (i + 1)))" "$f" > "$slice_file"
done
with sed:
sed -r '
h
s/(.,.),./\1/w file1.txt
g
s/.,.,(.,.),./\1/w file2.txt' file.txt

awk: two files are queried

I have two files
file1:
>string1<TAB>Name1
>string2<TAB>Name2
>string3<TAB>Name3
file2:
>string1<TAB>sequence1
>string2<TAB>sequence2
I want to use awk to compare column 1 of respective files. If both files share a column 1 value I want to print column 2 of file1 followed by column 2 of file2. For example, for the above files my expected output is:
Name1<TAB>sequence1
Name2<TAB>sequence2
this is my code:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$2], $2 }' file1 file2 >out
But the only thing I get is an empty first columnsequence
where is the error here?
your assignment is not right.
$ awk 'BEGIN {FS=OFS="\t"}
NR==FNR {a[$1]=$2; next}
$1 in a {print a[$1],$2}' file1 file2
Name1 sequence1
Name2 sequence2

Using a command-line utility to perform the following map-updates

I'm a complete newbie to using command-line utilities and am wondering how to process information as following:
mapping.txt:
80 001 002
81 011 012 013 014
82 021 022
...
input.txt:
81 103823044
80 103823054
81 103823064
...
Desired output.txt:
103823044|011|
103823044|012|
103823044|013|
103823044|014|
103823054|001|
103823054|002|
103823064|011|
103823064|012|
103823064|013|
103823064|014|
I've done simple mapping wherein the column numbers are fixed but I'm unsure of how to map a dynamic number of columns to the desired output
If order is not important, join and awk can do the job easily.
$ join <(sort input.txt) <(sort mapping.txt) | awk -v OFS="|" '{for (i=3;i<NF;i++) print $2, $i OFS}'
103823054|001|
103823044|011|
103823044|012|
103823044|013|
103823064|011|
103823064|012|
103823064|013|
Here's a GNU awk script that uses multi-dimensional arrays to do what you want:
#!/usr/bin/awk -f
BEGIN { OFS="|" }
FNR==NR { for(i=2;i<=NF;i++) a[$1][$i]; next }
$1 in a { for(k in a[$1]) print $2, k, "" }
If you save that to a file like script.awk and then chmod +x script.awk you can run it like:
$ ./script.awk mapping.txt input.txt
103823044|011|
103823044|012|
103823044|013|
103823044|014|
103823054|002|
103823054|001|
103823064|011|
103823064|012|
103823064|013|
103823064|014|
Here's a breakdown of the script:
BEGIN - set the output field separator to |
FNR==NR - process the first file (mapping.txt) and store the data in a multi-dimensional array by $1 first, then by the other fields. next to skip any other line processing.
$1 in a - test to see if the line has a mapping. If so, print the corresponding mappings out in order(also GNU awk). The commas in the print command are converted to the OFS value.
It could be remade a "one-liner" like:
awk -v OFS="|" 'FNR==NR {for(i=2;i<=NF;i++) a[$1][$i]; next} $1 in a {for(k in a[$1]) print $2, k, ""}' mapping.txt input.txt
Here's a version of the script that uses a single dimensional array to store $0 then split()s it later to preserve order:
#!/usr/bin/awk -f
BEGIN { OFS="|" }
FNR==NR { a[$1]=$0; next }
$1 in a { c=split(a[$1], b); for(i=2;i<=c;i++) print $2, b[i], "" }

find duplicate data between files in unix

I have 2 files with contents:-
file1:
918802944821 919968005200 kushinagar
919711354546 919211999924 delhi
915555555555 916666666666 kanpur
919711354546 915686524578 hehe
918802944821 4752168549 hfhkjh
file2:-
919211999924 919711354546 ghaziabad
919999999999 918888888888 lucknow
912222222222 911111111111 chandauli
918802944821 916325478965 hfhjdhjd
Now notice that number1 and number2 are interchanged in file1 and file2. I want to print only this duplicate line on the screen. to be more specific i want only the numbers or line to be printed on the screen which are duplicate like 8888888888 and 7777777777 are duplicate in the two files. I want only these two numbers on the screen or the whole line on the screen..
Using awk you can do:
awk 'FNR==NR{a[$1,$2]++;next} a[$2,$1]' f1 f2
7777777777 8888888888 pqr
EDIT: Based on your edited question you can do:
awk 'FNR==NR{a[$1]++;b[$2]++;next} a[$1] || b[$1] {print $1} a[$2] || b[$2]{print $2}' f1 f2
919211999924
919711354546
918802944821
kent$ awk 'NR==FNR{a[$2 FS $1]=1;next}a[$1 FS $2]{print $1,$2}' f1 f2
7777777777 8888888888

Resources