Merge unsorted lines from two files based on similar part - bash

I am wondering if is it possible to merge information from two files together based on a similar part. file1 is ID with sequence after the blast, and file2 contains taxonomic names corresponding to two first numbers in name of sequences.
file 1:
>301-89_IDNAGNDJ_171582
>301-88_ALPEKDJF_119660
>301-88_ALPEKDJF_112039
...
file2:
301-89--sample1
301-88--sample2
...
output:
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
The files are unsorted and file1 contains more lines where is first two numbers similar to the first two numbers in one line in file2. I am looking for some tips/help on how to do that, it is possible to do that like this? which command or language should I use?

(mawk/nawk/gawk -e/-ce/-Pe) '
FNR == !_ {
_ = ! ( ___=match(FS=FNR==NR ? "[-][-]" : "[>_]", "[>-]"))
$_ = $_
} FNR == NR { __[$!_]="--"$NF; next } sub("$", __[$___])' file2.txt file1.txt
———————————————————————————
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_112039--sample2
>301-88_ALPEKDJF_119660--sample2

Using awk
$ awk -F"[_-]" 'BEGIN{OFS="-"}NR==FNR{a[$2]=$4;next}{print $0,a[$2]}' file2 OFS="--" file1
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2

I am wondering if is it possible to merge information from two files together based on a similar part
Yes ...
The files are unsorted
... but only if they're sorted.
It's easier if we transform them so the delimiters are consistent, and then format it back together later:
sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' file1 produces
301-88 ALPEKDJF_112039
301-88 ALPEKDJF_119660
301-89 IDNAGNDJ_171582
...
which we can just pipe through sort -k1
sed 's/--/ /' f2 produces
301-89 sample1
301-88 sample2
...
which we can sort the same way
join sorted1 sorted2 (with the sorted results of the previous steps) produces
301-88 ALPEKDJF_112039 sample2
301-88 ALPEKDJF_119660 sample2
301-89 IDNAGNDJ_171582 sample1
...
and finally we can format those 3 fields as you originally wanted, by piping through
sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
If it's reasonable to sort them on the fly, we can just do that using process substitution:
$ join \
<( sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' f1 | sort -k1 ) \
<( sed 's/--/ /' f2 | sort -k1 ) \
| sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
301-88_ALPEKDJF_112039--sample2
301-88_ALPEKDJF_119660--sample2
301-89_IDNAGNDJ_171582--sample1
...
If it's not reasonable to sort the files - on the fly or otherwise - you're going to end up building a hash in memory, like the awk answer is doing. Give them both a try and see which is faster.

Related

Compare csv files based on column value

I have two large csv files:
File1.csv
id,name,code
1,dummy,0
2,micheal,3
5,abc,4
File2.csv
id,name,code
2,micheal,4
5,abc,4
1,cd,0
I want to compare two files based on id and if any of the columns are mismatched, I want to output those rows.
for example for the id 1 name is different and for id 2 the code is different, the output should be:
output
1,cd,0
2,micheal,4
and yes both files will have the same ids, could be in different order though.
I want to write a script that can give me above output.
If you need what in File2 is not paired with File1, you can use Miller and this simple command
mlr --csv join --np --ur -j id,name,code -f File1.csv File2.csv >./out.csv
In output you will have
+----+---------+------+
| id | name | code |
+----+---------+------+
| 2 | micheal | 4 |
| 1 | cd | 0 |
+----+---------+------+
awk -F, 'NR==FNR && FNR!=1 { map[$0]=1;next } FNR!=1 { if ( !map[$0] ) { print } }' File1.csv File2.csv
Set the field separator to comma. For the first file (NR==FNR), create an array map with the line as the first index. Then for the second file, if there is no entry for the line in map, print the line.
The tool of choice for finding differences between files is, of course, diff. Here, it doesn't really matter if these files are comma-separated or in some other format because you're really only interested in lines that differ.
Knowing that both files contain the same IDs makes this quite easy, although the fact that they will not necessarily be in the same order requires to first sort them both.
In your example, you want as output the lines from File2 so running the diff output through a grep for ^> will give you that.
Finally, let's get rid of the two additional characters at the beginning of the output lines that will have been inserted by diff, using cut:
diff <(sort File1.csv) <(sort File2.csv) | grep '^>' | cut -c3-

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Intersection of N files

I've got a little problem in bash.
I've got N files containing filenames, and I would like to find the list of filenames which are contained in all the file (the intersection of the files).
When there is 2 files, I found this solution: sort file1 file2 | uniq -d, and this is actually doing what I want.
But how to generalize it to N files in the same folder ?
File1
1
2
3
4
File2
1
4
File3
2
3
4
Output expected:
4
Thanks in advance,
Best Regards.
I`m not Marc B, but still, here`s the implementation of his idea:
intersect() {
sort "$#" | uniq -cd | grep "^[^0-9]*$# "
}
# usage example
intersect file1 file2 file3
[EDIT:] To overcome the problem of duplicate lines within the same file, I`d do something like this:
intersect() {
for file in "$#"; do
sort -u "$file"
done | sort | uniq -cd | grep "^[^0-9]*$# "
}

Match and merge lines based on the first column

I have 2 files:
File1
123:dataset1:dataset932
534940023023:dataset:dataset039302
49930:dataset9203:dataset2003
File2
49930:399402:3949304:293000232:30203993
123:49030:1204:9300:293920
534940023023:49993029:3949203:49293904:29399
and I would like to create
Desired result:
49930:399402:3949304:293000232:30203993:dataset9203:dataset2003
534940023023:49993029:3949203:49293904:29399:dataset:dataset039302
etc
where the result contains one line for each pair of input lines that have identical first column (with : as the column separator).
The join command is your friend here. You'll likely need to sort the inputs (either pre-sort the files, or use a process substitution if available - e.g. with bash).
Something like:
join -t ':' <(sort file2) <(sort file1) >file3
When you do not want to sort files, play with grep:
while IFS=: read key others; do
echo "${key}:${others}:$(grep "^${key}:" file1 | cut -d: -f2-)"
done < file2

How to remove duplicates by column (inverse ordering)

I've looking for this in here, but did not found the exact case. Sorry if it is duplicated, but I couldn't find it.
I have a huge file in Debian that contains 4 columns separated by "#", with the following format:
username#source#date#time
For example:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I want to print unique rows based on the first two columns, and if duplicates found, it has to print the last event based on date/time. With the list above, the result should be:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I have tested it using two commands:
cat file | sort -u -t# -k1,2
cat file | sort -r -u -t# -k1,2
But both of them print the following:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40 --> Wrong line, it is older than the duplicate one
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
Is there any way to do it?
Thanks!
This should work
tac file | awk -F# '!a[$1,$2]++' | tac
Output
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
First, you need sort the input file to ensure the order of lines, e.g. for duplicate username#source you will get ordered times. Best is sort reverse, so last event comes first. This can be done with an simple sort, like:
sort -r < yourfile
This will produce from your input the next:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A222222#Juniper#2014-08-07#14:31:40
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
reverse-ordered lines, where for the each username#source combination the latest event comes first.
next, you need somewhat filter the sorted lines, to get only the first event. This can be done, with several tools, like awk or uniq or perl and such,
So, the solution
sort -r <yourfile | uniq -w16
or
sort -r <yourfile | awk -F# '!seen[$1,$2]++'
or
sort -r yourfile | perl -F'#' -lanE 'say $_ unless $seen{"$F[0],$F[1]"}++'
all the above will print the next
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
Finally you can re-sort the unique lines as you want and needed.
awk -F\# '{ p = ($1 FS $2 in a ); a[$1 FS $2] = $0 }
!p { keys[++k] = $1 FS $2 }
END { for (k = 1; k in keys; ++k) print a[keys[k]] }' file
Output:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
If you know for a fact that the first column is always 7 chars long, and second column also 7 chars long, you can extract unique lines considering only the first 16 characters with:
uniq file -w 16
Since you want the latter duplicate, you can reverse the data using tac prior to uniq and then reverse the output again:
tac file | uniq -w 16 | tac
Update: As commented below, uniq needs the lines to be sorted. In which case this starts to become contrived, and the awk based suggestions are better. Something like this would still work though:
sort -s -t"#" -k1,2 file | tac | uniq -w 16 | tac

Resources