bash- get all lines with the same column value in two files - bash

I have two text files each with 3 fields. I need to get the lines with the same value on the third field. The 3rd field value is unique in each file. Example:
file1:
1 John 300
2 Eli 200
3 Chris 100
4 Ann 600
file2:
6 Kevin 250
7 Nancy 300
8 John 100
output:
1 John 300
7 Nancy 300
3 Chris 100
8 John 100
When I use the following command:
cat file1 file2 | sort -k 3 | uniq -c -f 2
I get only one row from an input file with the duplicate value. I need both!

this one-liner gives you that output:
awk 'NR==FNR{a[$3]=$0;next}$3 in a{print a[$3];print}' file1 file2

My solution is
join -1 3 -2 3 <(sort -k3 file1) <(sort -k3 file2) | awk '{print $2, $3, $1; print $4, $5, $1}'
or
join -1 3 -2 3 <(sort -k3 file1) <(sort -k3 file2) -o "1.1 1.2 0 2.1 2.2 0" | xargs -n3

Related

awk print a specific column from multiple files to a new file

I have few files. e.g.
file1.txt
1 2 3 4
2 3 3 4
3 2 4 2
7 2 0 0
1 2 9 9
3 0 9 0
file2.txt
3
4
2
33
NAN
NAN
file3.txt
2
4
4
NAN
NAN
NAN
I would like to print 1st column from each file in a new file with replacing NAN to "?".
The desire output file:
ofile.txt
1 3 2
2 4 4
3 2 4
7 33 ?
1 ? ?
3 ? ?
I was trying with awk '$1 {print}' file1.txt file2.txt file3.txt > ofile.txt
but it is not printing my desire output.
You may use this single awk to get this output:
awk '{s[FNR] = (s[FNR] == "" ? "" : s[FNR] "\t") ($1 == "NAN" ? "?" : $1)}
END{for (i=1; i<=length(s); i++) print s[i]}' file[123].txt
1 3 2
2 4 4
3 2 4
7 33 ?
1 ? ?
3 ? ?
To store output in a file use > ofile.txt at the end of above command.
EDIT: Since OP has control M characters in his/her Input_file so adding solution as per that now. This should remove control M characters from Input_files too.
paste <(awk '{gsub(/\r/,"");print $1}' file1) file2 file3 | awk '{gsub(/\r/,"");gsub(/NAN/,"?")} 1'
Could you please try following.
paste <(awk '{print $1}' file1) file2 file3 | awk '{gsub(/NAN/,"?")} 1'
Using multiple awk
paste <(awk '{print $1}' file1 ) <(awk '{print $1}' file2 ) <(awk '{print $1}' file3) | sed 's/NAN/\?/g'
or
paste <(awk '{print $1}' file1 ) <(awk '{print $1}' file2 ) <(awk '{print $1}' file3) |awk '{gsub("NAN", "?", $0); print}'

How to merge two tab-separated files and predefine formatting of missing values?

I am trying to merge two unsorted tab separated files by a column of partially overlapping identifiers (gene#) with the option of predefining missing values and keeping the order of the first table.
When using paste on my two example tables missing values end up as empty space.
cat file1
c3 100 300 gene4
c1 300 400 gene1
c13 600 700 gene2
cat file2
gene1 4.2 0.001
gene4 1.05 0.5
paste file1 file2
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2
As you see the result not surprisingly shows empty spaces in non matched lines. Is there a way to keep the order of file1 and fill lines like the third as follows:
c3 100 300 gene4 gene4 1.05 0.5
c1 300 400 gene1 gene1 4.2 0.001
c13 600 700 gene2 NA 1 1
I assume one way could be to build an awk conditional construct. It would be great if you could point me in the right direction.
With awk please try the following:
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
which yields:
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2 N/A 1 1
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
[Explanations]
In the 1st line, FNR==NR { command; next} is an idiom to execute the command only when reading the 1st file in the argument list ("file2" in this case). Then it creates maps (aka associative arrays) to associate values in "file2" to genes
as:
gene1 => gene1 (with array a)
gene1 => 4.2 (with array b)
gene1 => 0.001 (with array c)
gene4 => gene4 (with array a)
gene4 => 1.05 (with array b)
gene4 => 0.5 (with array c)
It is not necessary that "file2" is sorted.
The following lines are executed only when reading the 2nd file ("file1") because these lines are skipped when reading the 1st file due to the next statement.
The line {if (!a[$4]) .. is a fallback to assign variables to default values when the associative array a[gene] is undefined (meaning the gene is not found in "file2").
The final line prints the contents of "file1" followed by the associated values via the gene.
You can use join:
join -e NA -o '1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3' -a 1 -1 5 -2 1 <(nl -w1 -s ' ' file1 | sort -k 5) <(sort -k 1 file2) | sed 's/NA\sNA$/1 1/' | sort -n | cut -d ' ' -f 2-
-e NA — replace all missing values with NA
-o ... — output format (field is specified using <file>.<field>)
-a 1 — Keep every line from the left file
-1 5, -2 1 — Fields used to join the files
file1, file2 — The files
nl -w1 -s ' ' file1 — file1 with numbered lines
<(sort -k X fileN) — File N ready to be joined on column X
s/NA\sNA$/1 1/ — Replace every NA NA on end of line with 1 1
| sort -n | cut -d ' ' -f 2- — sort numerically and remove the first column
The example above uses spaces on output. To use tabs, append | tr ' ' '\t':
join -e NA -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' -a 1 -1 4 -2 1 file1 file2 | sed 's/NA\sNA$/1 1/' | tr ' ' '\t'
The broken lines have a TAB as the last character. Fix this with
paste file1 file2 | sed 's/\t$/\tNA\t1\t1/g'

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

Unix Compare Two CSV files using comm

I have two CSV files. 1.csv files has 718 entries and 2.csv has 68000 entries.
#cat 1.csv
#Num #Name
1 BoB
2 Jack
3 John
4 Hawk
5 Scot
...........
#cat 2.csv
#Num #Name
1 BoB
2 John
3 Linda
4 Hawk
5 Scot
........
I knew how to compare two files,when only one column(Names) is available in both and to get the matching names.
#comm -12 <(sort 1.csv) <(sort 2.csv)
Now i would like to check, If Num in 1.csv is matching with Num in 2.csv, What is the associated "Names" from both the csv files for that matched Num ?
Result :
1,Bob,Bob
2,Jack,John
3,John,Linda
4,Hawk,Hawk
5,Scot,Scot
..........
How to do achieve this using comm ?
You can use the join command to perform inner join on 2 csv files on the 1st field i.e the number. Here is an example:
$ cat f1.csv
1 BoB
2 Jack
3 John
4 Hawk
5 Scot
6 ExtraInF1
$ cat f2.csv
1 BoB
3 Linda
4 Hawk
2 John
5 Scot
7 ExtraInF2
$ join <(sort -t ' ' -k 1 f1.csv) <(sort -t ' ' -k 1 f2.csv)
1 BoB BoB
2 Jack John
3 John Linda
4 Hawk Hawk
5 Scot Scot
$ join <(sort -t ' ' -k 1 f1.csv) <(sort -t ' ' -k 1 f2.csv) | tr -s ' ' ,
1,BoB,BoB
2,Jack,John
3,John,Linda
4,Hawk,Hawk
5,Scot,Scot
$
Note I have added few dummy rows(number 6 and 7) and also note that they haven't appeared in the output as they aren't present in both files.
<(sort -t ' ' -k 1 f1.csv) means process substitution i.e substitute the output of the process at this place. sort with delimiter as space(-t ' ') and on 1st key i.e 1st column(-k 1) and join by default performs inner join on 1st column of both files.
Another one-liner for inner join using the join command
join -1 1 -2 1 <(sort 1.csv) <(sort 2.csv) | tr -s ' ' ,
-1 2 : sort on file 1, 1st field
-2 1 : sort on file 2, 1st field
tr -s squeezes multiple spaces into a single space and replaces it by a comma(,)

Resources