Match and merge lines based on the first column - shell

I have 2 files:
File1
123:dataset1:dataset932
534940023023:dataset:dataset039302
49930:dataset9203:dataset2003
File2
49930:399402:3949304:293000232:30203993
123:49030:1204:9300:293920
534940023023:49993029:3949203:49293904:29399
and I would like to create
Desired result:
49930:399402:3949304:293000232:30203993:dataset9203:dataset2003
534940023023:49993029:3949203:49293904:29399:dataset:dataset039302
etc
where the result contains one line for each pair of input lines that have identical first column (with : as the column separator).

The join command is your friend here. You'll likely need to sort the inputs (either pre-sort the files, or use a process substitution if available - e.g. with bash).
Something like:
join -t ':' <(sort file2) <(sort file1) >file3

When you do not want to sort files, play with grep:
while IFS=: read key others; do
echo "${key}:${others}:$(grep "^${key}:" file1 | cut -d: -f2-)"
done < file2

Related

Merge unsorted lines from two files based on similar part

I am wondering if is it possible to merge information from two files together based on a similar part. file1 is ID with sequence after the blast, and file2 contains taxonomic names corresponding to two first numbers in name of sequences.
file 1:
>301-89_IDNAGNDJ_171582
>301-88_ALPEKDJF_119660
>301-88_ALPEKDJF_112039
...
file2:
301-89--sample1
301-88--sample2
...
output:
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
The files are unsorted and file1 contains more lines where is first two numbers similar to the first two numbers in one line in file2. I am looking for some tips/help on how to do that, it is possible to do that like this? which command or language should I use?
(mawk/nawk/gawk -e/-ce/-Pe) '
FNR == !_ {
_ = ! ( ___=match(FS=FNR==NR ? "[-][-]" : "[>_]", "[>-]"))
$_ = $_
} FNR == NR { __[$!_]="--"$NF; next } sub("$", __[$___])' file2.txt file1.txt
———————————————————————————
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_112039--sample2
>301-88_ALPEKDJF_119660--sample2
Using awk
$ awk -F"[_-]" 'BEGIN{OFS="-"}NR==FNR{a[$2]=$4;next}{print $0,a[$2]}' file2 OFS="--" file1
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
I am wondering if is it possible to merge information from two files together based on a similar part
Yes ...
The files are unsorted
... but only if they're sorted.
It's easier if we transform them so the delimiters are consistent, and then format it back together later:
sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' file1 produces
301-88 ALPEKDJF_112039
301-88 ALPEKDJF_119660
301-89 IDNAGNDJ_171582
...
which we can just pipe through sort -k1
sed 's/--/ /' f2 produces
301-89 sample1
301-88 sample2
...
which we can sort the same way
join sorted1 sorted2 (with the sorted results of the previous steps) produces
301-88 ALPEKDJF_112039 sample2
301-88 ALPEKDJF_119660 sample2
301-89 IDNAGNDJ_171582 sample1
...
and finally we can format those 3 fields as you originally wanted, by piping through
sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
If it's reasonable to sort them on the fly, we can just do that using process substitution:
$ join \
<( sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' f1 | sort -k1 ) \
<( sed 's/--/ /' f2 | sort -k1 ) \
| sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
301-88_ALPEKDJF_112039--sample2
301-88_ALPEKDJF_119660--sample2
301-89_IDNAGNDJ_171582--sample1
...
If it's not reasonable to sort the files - on the fly or otherwise - you're going to end up building a hash in memory, like the awk answer is doing. Give them both a try and see which is faster.

Filter records from one file based on a values present in another file using Unix

I have an Input csv file Input feed
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
There is an output error csv file which is generated from this input file which has the Primary Key
Error File
Pk,Error_Reason
D,Failure
E, Failure
F, Failure
I want to extract all the records from the input file and save it into a new file for which there is a Primary key entry in Error file.
Basically my new file should look like this:
New Input feed
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
I am a beginner in Unix and I have tried Awk command.
The Approach I have tried is, get all the primary key values into a file.
akw -F"," '{print $2}' error.csv >> error_pk.csv
Now I need to filter out the records from the input.csv for all the primary key values present in error.pk
Using awk. As there is leading space in the error file, it needs to be trimmend off first, I'm using sub for that. Then, since the titles of the first column are not identical, (PK vs Pk) that needs to be handled separately with FNR==1:
$ awk -F, ' # set separator
NR==FNR { # process the first file
sub(/^ */,"") # trim leading space
a[$1] # hash the first column
next
}
FNR==1 || ($1 in a)' error input # output tthe header record and if match hashed
Output:
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
You can use join.
First remove everything afte the comma from second file
Join on the first field from both files
cat <<EOF >file1
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
EOF
cat <<EOF >file2
PK,Error_Reason
D,Failure
E,Failure
F,Failure
EOF
join -t, -11 -21 <(sort -k1 file1) <(cut -d, -f1 file2 | sort -k1)
If you need the file to be sorted according to file1, you can number the lines in first file, join the files, re-sort using the line numbers and then remove the numbers from the output:
join -t, -12 -21 <(nl -w1 -s, file1 | sort -t, -k2) <(cut -d, -f1 file2 | sort -k1) |
sort -t, -k2 | cut -d, -f1,3-
You can use grep -f with a file with search items. Cut off at the ,.
grep -Ef <(sed -r 's/([^,]*).*/^\1,/' file2) file1
When you want a header in the output,

shell - compare files and update matching string awk/sed/diff/grep/csv

I need to compare 2 csv files and make modifications to the second column. I wrote out the logic out of how I would want to achieve this however, it seems to confuse the thread a lot more than I wanted too so I'll just write out the example.
Any help would be appreciated. Thanks in advance.
file1
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName3
user4,distinguishedName4
user5,distinguishedName5
file2
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
desired outcome:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
The solution using join command combined with awk command:
join -t',' -j1 -a1 -a2 file1 file2 | awk -F',' '{if(NF==3) $0=$1FS$3}1'
The output:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
Explanation:
-- for join command:
-t',' - defines field separator
-j1 - tells to join on first field 1
-a FILENUM - print unpairable lines coming from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2
-- for awk command:
NF - contains a total number of fields
FS - field separator(i.e. ,)
if(NF==3) $0=$1FS$3 - the condition, checks if there's a complement third field(as result of joining the files on lines with common first field) to perform the replacement
https://linux.die.net/man/1/join
awk to the rescue!
awk -F, '!a[$1]++' file2 file1
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
user2,distinguishedName2
user4,distinguishedName4
this order is based on file2 and file1 record order, if you want sorted order just pipe to sort
awk ... | sort

How can I merge two file smartly with unique key?

I have two files, such like:
File1:
A,Content1
B,Content2
C,Content3
File2:
D,Content4
E,Content5
B,Content6
There is the same key in file1 and file2, could I merge two files smartly that the result file is just as:
A,Content1
B,Content2
C,Content3
D,Content4
F,Content5
You should be able to accomplish this with a single sort:
sort -t',' -k1,1 -u file1 file2
It sets the field separator to comma, sorts and dedupes on only the first field.
If your files aren't too big (which is what I'll assume from your usage of shellscript) :
#!/bin/bash
keys=$(cat "$#" | cut -d',' -f1 | sort -u)
for key in $keys
do
grep -h $key "$#" | head -1
done
Basically :
extract the keys (stuff that is before the first comma)
find the first occurrence of that key in the files (that's the head -1)

Joining two text files based on a common field (ip address)

File1
abcd-efg|random1||abcd|10.10.1.1||
bcde-ab|random2||bc|10.1.2.2||
efgh-bd|ramdom3||fgh|10.2.1.1||
ijkl|random4||mno|10.3.2.3||
File2
10.10.1.1| yes
10.1.2.2| no
10.2.1.1| yes
10.3.2.3| no
Output should be
abcd-efg|random1||abcd|10.10.1.1||yes
bcde-ab|random2||bc|10.1.2.2||no
efgh-bd|ramdom3||fgh|10.2.1.1||yes
ijkl|random4||mno|10.3.2.3||no
I was trying to join both text files based on ip address using awk and joins but some how not able to get the right output.
Could you help me get through the right output.Thanks in advance
$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no
This approach will work even if the IPs are in the files in different orders.
How it works
-F'|'
Set the field separator on input to |.
FNR==NR{a[$1]=$2; next}
When reading the first file, file2, save the second field as a value in associative array a under the key of the first field. Skip remaining commands and jump to the next line.
print $0 a[$5]
If we get here, we are working on the second file, file1. Print the line followed by the value of a for this IP.
BSD/OSX
On BSD (OSX) awk, try:
awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1
Unix join command can be used for this
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 file1 file2
Explanation of options:
-t\| : Field separator is '|' (escaped)
-j1 5 -j2 1 : Join based on 5th field of file1 and 1st field of file2
-o1.1,1.2,1.3,1.4,1.5,1.6,2.2 : Output the 6 fields from file1 and 2nd field from file2
If the input files are not sorted, they need to be sorted first, like below
join -t\| -j1 5 -j2 1 -o1.1,1.2,1.3,1.4,1.5,1.6,2.2 <(sort -t'|' -k5 file1) <(sort -t'|' -k1 file2)
Assuming both files have IP address in same order as shown in OP's example
paste -d'\0' file1 <(cut -d' ' -f2 file2)
cut -d' ' -f2 file2 select second column of file2, column separation is space character specified by delimiter -d' '
Using process substitution, output of cut command is passed as file input to paste command
paste command then combines file1 and output of cut column wise without any character in between (reference: paste without delimiter)

Resources