Joining two files that both have duplicate rows

Joining two files that both have duplicate rows - bash

I am trying to join two files that have identical column 1 and different column 2:
File1
aaa 1
bbb 3
bbb 3
ccc 1
ccc 1
ccc 0
File2
aaa 2
bbb 2
bbb 2
ccc 1
ccc 1
ccc 0
When I try to join them with
join File1 File2 > File3
I get
aaa 1 2
bbb 3 2
bbb 3 2
bbb 3 2
bbb 3 2
ccc 1 1
ccc 1 1
ccc 1 0
ccc 1 1
ccc 1 1
ccc 1 0
ccc 0 1
ccc 0 1
ccc 0 0
join is trying to expand the duplicates when all I want it to do is go line-by line so the output should be
aaa 1 2
bbb 3 2
bbb 3 2
ccc 1 1
ccc 1 1
ccc 0 0
How do I tell join to ignore duplicates and just combine the files line-by-line?
EDIT: This is being done in a loop with multiple files that all have the same column 1 but different column 2. I am joining the first two files into a temporary file and then looping through the other files joining with that temporary file.

Based on a suggestion from #Andre Wildberg, this worked best:
paste File1 <(cut -d " " -f 2 File2)
This allowed be to loop through a list of files:
cat File1 > tmp
for file in $files
do
paste tmp <(cut -d " " -f 2 $file) > tmpf
mv tmpf tmp
done
mv tmp FinalFile

Assumptions:
all files have the same number of rows
all files have the same values in the first column for the same numbered row
the final result set can fit into memory
Sample input:
$ for f in f{1..4}
do
echo "############ $f"
cat $f
done
############ f1
aaa 1
bbb 3
bbb 3
ccc 1
ccc 1
ccc 0
############ f2
aaa 2
bbb 2
bbb 2
ccc 1
ccc 1
ccc 0
############ f3
aaa 12
bbb 12
bbb 12
ccc 11
ccc 11
ccc 10
############ f4
aaa 202
bbb 202
bbb 202
ccc 201
ccc 201
ccc 200
One awk idea:
awk '
FNR==NR { a[FNR]=$0; next }
{ a[FNR]=a[FNR] OFS $2 }
END { for (i=1;i<=FNR;i++)
print a[i]
}
' f1 f2 f3 f4
This generates:
aaa 1 2 12 202
bbb 3 2 12 202
bbb 3 2 12 202
ccc 1 1 11 201
ccc 1 1 11 201
ccc 0 0 10 200

Related

How to use this awk command without affecting the header

Good nigt. I have this two files:
File 1 - with phenotype informations, the first column are the Ids, the orinal file has 400 rows:
ID a b c d
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
217 2 0 2 1
218 0 2 0 2
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
I used this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file2 file1 > file3
and I can get this output(file 3):
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
but I lose the header, how do I not lose the header?

To keep the header of the second file, add a condition{action} like this:
awk 'NR==FNR {a[$1]; next}
FNR==1 {print $0; next} # <= this will print the header of file2.
$1 in a {print $0}' file1 file2
NR holds the total record number while FNR is the file record number, it counts the records of the file currently being processed. Also the next statements are important, so that to continue with the next record and don't try the rest of the actions.

How to replace columns (matching pattern) using awk?

I am trying to use awk to edit files but I cant manage to do it without creating intermediate files.
Basicaly I want to search using column 1 in file2 and file3 and so on, and replace the 2nd column for matching 1st column lines. (note that file2 and file3 may contain other stuff)
I have
File1.txt
aaa 111
aaa 222
bbb 333
bbb 444
File2.txt
zzz zzz
aaa 999
zzz zzz
aaa 888
File3.txt
bbb 000
bbb 001
yyy yyy
yyy yyy
Desired output
aaa 999
aaa 888
bbb 000
bbb 001

this does what you specified but I guess there are many edge cases not covered.
$ awk 'NR==FNR{a[$1]; next} $1 in a' file{1..3}
aaa 999
aaa 888
bbb 000
bbb 001

How to find any decrement in the column?

I am trying to find out the decrements in a column and if found then print the last highest value.
For example:
From 111 to 445 there is a continous increment in the column.But 333 is less then the number before it.
111 aaa
112 aaa
112 aaa
113 sdf
115 aaa
222 ddd
333 sss
333 sss
444 sss
445 sss
333 aaa<<<<<<this is less then the number above it (445)
If any such scenario is found then print 445 sss

Like this, for example:
$ awk '{if (before>$1) {print before_line}} {before=$1; before_line=$0}' a
445 sss
What is it doing? Check the variable before and compare its value with the current. In case it is bigger, print the line.
It works for many cases as well:
$ cat a
111 aaa
112 aaa
112 aaa
113 sdf
115 aaa <--- this
15 aaa
222 ddd
333 sss
333 sss
444 sss
445 sss <--- this
333 aaa
$ awk '{if (before>$1) {print before_line}} {before=$1; before_line=$0}' a
115 aaa
445 sss

Store each number in a single variable called prevNumber then when you come to print the next one do a check e.g. if (newNumber < prevNumber) print prevNumber;
dont really know what language you are using

You can say:
awk '$1 > max {max=$1; maxline=$0}; END{ print maxline}' inputfile
For your input, it'd print:
445 sss

Replace column by comparing with the other column

I have a file like this
1 CC AAA
1 Na AAA
1 Na AAA
1 Na AAA
1 Na AAA
1 CC BBB
1 Na BBB
1 Na BBB
1 xa BBB
1 CC CCC
1 Na CCC
1 da CCC
I would like to remove the column 2 and then replce with "01" for AAA, "02" for BBB and so on for entire file. Finally the output should looks like,
1 01 AAA
1 01 AAA
1 01 AAA
1 01 AAA
1 01 AAA
1 02 BBB
1 02 BBB
1 02 BBB
1 02 BBB
1 03 CCC
1 03 CCC
1 03 CCC
I dont have any clue to make this working. Please help me if possible. Here in every cc the new variable starts. that is from AAA to BBB can be track by only CC in 2nd column.

One way of doing it in awk:
awk '$3!=a&&NF{a=$3;x=sprintf("%02d",++x);print $1,x,$3;next}$3==a&&NF{print $1,x,$3;next }1' inputFile

Here's one way using awk:
awk '$3 != r { ++i } { $2 = sprintf ("%02d", i) } { r = $3 }1' OFS="\t" file
I've set the OFS to a tab-char, but you can choose what you like. Results:
1 01 AAA
1 01 AAA
1 01 AAA
1 01 AAA
1 01 AAA
1 02 BBB
1 02 BBB
1 02 BBB
1 02 BBB
1 03 CCC
1 03 CCC
1 03 CCC

Seems like you want:
awk '$2=="CC" { a+=1 } {$2=sprintf("%02d",a)} 1' input

use bash to combine values of the same name

Input file:
AAA 2 3 4 5
BBB 3 4 5
AAA 23 21 34
BBB 4 5 62
I want the output to be:
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
I feel that I should use awk and sed but not sure how to realize it. Does anyone have any good ideas? Thanks.

This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]* \)\(.*\)\n\1/\1\2/;ta;P;D'
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
or gnu awk;
awk '{if($1 in a){line=$0;sub(/[^ ]* /,"",line);a[$1]=a[$1]line;next};a[$1]=$0}END{n=asort(a);for(i=1;i<=n;i++)print a[i]}' file
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62

Here is an awk 1 liner to solve above problem:
awk '{line=$2;for(i=3; i<=NF; i++) line=line " " $i; arr[$1]=arr[$1] " " line} END{for (val in arr) print val, arr[val]}' file

Using bash version 4's associative arrays
$ declare -A vals
$ while read key nums; do vals[$key]+="$nums "; done < filename
$ for key in "${!vals[#]}"; do printf "%s %s\n" "$key" "${vals[$key]}"; done
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Joining two files that both have duplicate rows - bash

Based on a suggestion from #Andre Wildberg, this worked best: paste File1 <(cut -d " " -f 2 File2) This allowed be to loop through a list of files: cat File1 > tmp for file in $files do paste tmp <(cut -d " " -f 2 $file) > tmpf mv tmpf tmp done mv tmp FinalFile

Related

How to use this awk command without affecting the header

How to replace columns (matching pattern) using awk?

How to find any decrement in the column?

Replace column by comparing with the other column

use bash to combine values of the same name

Categories

Resources