BASH - Join on non-first column - bash

I am trying to join 2 files together - both files are in CSV format - both files have the same columns. Here is an example of each file :
File 1:
CustName,AccountReference,InvoiceDate,InvoiceRefID,TelNo,Rental,GPRS,Mnet,MnetPlus,SMS,CSD,IntRoaming,NetAmount
acme,107309 ,2011-09-24 12:47:11.000,AP/157371,07741992165 ,2.3900,.0000,.0000,.0000,.0000,.0000,.0000,2.3900
acme,107309 ,2011-09-24 12:58:32.000,AP/162874,07740992165 ,2.0000,.0000,.0000,.0000,.0000,.0000,.0000,2.0000
anot,107308 ,2011-09-24 12:58:32.000,AP/162874,07824912428 ,2.0000,.0000,.0000,.0000,.0000,.0000,.0000,2.0000
anot,107308 ,2011-09-24 12:47:11.000,AP/157371,07834919928 ,1.5500,.0000,.0000,.0000,.0000,.0000,.0000,1.5500
File 2:
CustName,AccountReference,InvoiceDate,InvoiceRefID,TelNo,Rental,GPRS,Mnet,MnetPlus,SMS,CSD,IntRoaming,NetAmount
acme,100046,2011-10-05 08:29:19,AB/020152,07824352342,12.77,0.00,0.00,0.00,0.00,0.00,0.00,12.77
anbe,100046,2011-10-05 08:29:19,AB/020152,07741992165,2.50,0.00,0.00,0.00,0.00,0.00,0.00,2.50
acve,100046,2011-10-05 08:29:19,AB/020152,07740992165,10.00,0.00,0.00,0.00,0.00,0.00,0.00,10.00
asce,100046,2011-10-05 08:29:19,AB/020152,07771335702,2.50,0.00,0.00,0.00,0.00,0.00,0.00,2.50
I would like to join the 2 files together - but just taking some of the columns the other columns can be ignored (some are the same, some are different) -
AccountRef,telno, rental_file1,rental_file2,gprs_file1,gprs_file2 etc etc ....
The join should be done on the telno column (it seems I have white space in file 1 - hope that can be ignored ?
i have found lots of examples using JOIN but all of them use the first column for the key on the join .... any pointers would be great - thanks

The basic answer is:
join -t , -1 3 -2 4 -1 6 -2 2 file1 file2
This will join the files file1 and file2 on column 3 from file with column 4 from file2, then on columns 6 and 2. The data files must be sorted on those same columns, of course. The -t , sets the separator for CSV - but join will not handle embedded commas inside quoted strings.
If your data is simple (no quoted strings) then you can also use awk. If your data has quoted strings which may contain commas, etc, then you need a CSV-aware tool. I'd probably use Perl with the Text::CSV module (and the Text::CSV_XS module for performance).

awk -F' *, *' 'NR > 1 && NR == FNR {
_[$5] = $0; next
}
NR == 1 {
print "AccountReference", "TelNo", "Rental_" ARGV[2], \
"Rental_" ARGV[3], "GPRS_" ARGV[2], "GPRS_" ARGV[3]
next
}
$5 in _ {
split(_[$5], t)
print $2, $5, $6, t[6], $7, t[7]
}' OFS=, file2 file1

Have a look at cat and cut :-)
For instance
cat file1 file2 | cut -d, -f2,5
yields
107309 ,07741992165
107309 ,07740992165
107308 ,07824912428
107308 ,07834919928
100046,07824352342
100046,07741992165
100046,07740992165
100046,07771335702

All the GNU utilities documented here:
http://www.gnu.org/s/coreutils/manual/html_node/index.html#Top
For your problem, see cat, cut, sort, uniq and join.

Related

Using bash comm command on columns but returning the entire line

I have two files, each with two columns and sorted only by the second column, such as:
File 1:
176 AAATC
6 CCGTG
80 TTTCG
File 2:
20 AAATC
77 CTTTT
50 TTTTT
I would like to use comm command using options -13 and -23 to get two different files reporting the different lines between the two files with the corresponding count number, but only comparing the second columns (i.e. the strings). What I tried so far was something like:
comm -23 <(cut -d$'\t' -f2 file1.txt) <(cut -d$'\t' -f2 file2.txt)
But I could only have the strings in output, without the numbers:
CCGTG
TTTCG
While what I want would be:
6 CCGTG
80 TTTCG
Any suggestion?
Thanks!
You can use join instead of comm:
join -1 2 -2 2 File1 File2 -a 1 -o 1.1,1.2,2.2
It will output the matching lines, too, but you can remove them with
| grep -v '[ACTG] [ACTG]'
Explanation:
-1 2 use the second column in file 1 for joining;
-2 2 similarly, use the second column in file 2;
-a 1 show also non-matching lines from file 1 - these are the ones you want in the end;
-o specifies the output format, here we want columns 1 and 2 from file 1 and column 2 from file 2 (this is just arbitrary, you can use column 1 as well, but the second command would be different: | grep -v '[ACTG] [0-9]').
comm is not the right tool for this job, and while join will work you also need to look at running join twice and then further filter the results with some other command (eg, grep).
One awk idea that requires a single pass through each input file:
awk 'BEGIN {FS=OFS="\t"}
FNR==NR { f1[$2]=$1; next } # save 1st file entries
$2 in f1 { delete f1[$2]; next } # 2nd file: if $2 in f1[] then delete f1[] entry and skip this line else ..
{ f2[$2]=$1 } # save 2nd file entries
END { # at this point:
# f1[] contains rows where field #2 only exists in the 1st file
# f2[] contains rows where field #2 only exists in the 2nd file
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in f1) print f1[i],i > "file-23"
for (i in f2) print f2[i],i > "file-13"
}
' file1 file2
NOTE: the PROCINFO["sorted_in"] line requires GNU awk; without this line we cannot guarantee the order of writes to the final output files, and OP would then need to add more (awk) code to maintain the ordering or use another OS-level utility (eg, sort) to sort the final files
This generates:
$ cat file-23
6 CCGTG
80 TTTCG
$ cat file-13
77 CTTTT
50 TTTTT

extract columns from multiple .csv files and merge them into one

I have three files from which I want to extract some columns and paste them in a new file. The files don't necessarily have the same number of lines. They are sorted on the values in their first column.
File 1 has the following structure:
col1;col2;col3;col4
SAMPLE-1;1;1;1
SAMPLE-2;1;1;1
SAMPLE-3;1;1;1
SAMPLE-4;1;1;1
This file is seperated by ";" instead of ","
File 2 has the following structure:
col5,col6,col7,col8
SAMPLE-1_OTHER_INFO,2,2,2
SAMPLE-2_OTHER_INFO,2,2,2
SAMPLE-3_OTHER_INFO,2,2,2
File 3 has the following structure:
col9,col10,col11,col12
SAMPLE-1_OTHER_INFO,3,3,3
SAMPLE-2_OTHER_INFO,3,3,3
SAMPLE-3_OTHER_INFO,3,3,3
The output file (summary.csv) should look like this:
col1,col2,col4,col6,col7,col10,col12
SAMPLE-1,1,1,2,2,3,3
SAMPLE-2,1,1,2,2,3,3
SAMPLE-3,1,1,2,2,3,3
SAMPLE-4,1,1,,,,
Basically the first columns of all three files contain the sample identifier. 'col1' of file1 should be the first column of the output file. The identifiers in col1 should then be matched with those in col5 and col9 of file2 and file3. The '_OTHER_INFO' part should not be taken into account when doing the comparison.
If there is a match, the info the col6, col7, col10 and col12 values of file 2 and 3 should be added.
If there is no match, the line should still be in the output file, but the last four columns should be empty (like in this case 'SAMPLE-4')
I was planning to perform this action with awk or the 'cut/paste' command. However I don't know how I should look for a match between the values in col1, col5 and col9.
try following and let me know if this helps you.
awk 'BEGIN{
FS=";"
}
FNR==1{
f++
}
f==1 && FNR>1{
a[$1]=$2","$4;
next
}
f>1 && FNR==1 {
FS=","
}
f==2 && FNR>1{
sub(/_.*/,"",$1);
b[$1]=$2","$3;
next
}
f==3 && FNR>1{
sub(/_.*/,"",$1);
c[$1]=$2","$4;
next
}
END{
print "col1,col2,col4,col6,col7,col10,col12";
for(i in a){
printf("%s,%s,%s,%s\n",i,a[i],b[i]?b[i]:",",c[i]?c[i]:",")
}
}
' file1 file2 file3
Will try to add explanation too in sometime.
EDIT1: adding a one-liner form of solution too.
awk 'BEGIN{FS=";"}FNR==1{f++} f==1 && FNR>1{;a[$1]=$2","$4;next} f>1 && FNR==1{FS=","} f==2&&FNR>1{sub(/_.*/,"",$1);b[$1]=$2","$3;next} f==3&&FNR>1{sub(/_.*/,"",$1);c[$1]=$2","$4;next} END{print "col1,col2,col4,col6,col7,col10,col12";for(i in a){printf("%s,%s,%s,%s\n",i,a[i],b[i]?b[i]:",",c[i]?c[i]:",")}}' file1 file2 file3
sort + sed trick (for sorted input files):
join -t, -j1 -a1 -o1.1,1.2,1.4,2.2,2.3 <(tr ';' ',' < file1) <(sed 's/_[^,]*//g' file2)
| join -t, - -a1 -o1.1,1.2,1.3,1.4,1.5,2.2,2.4 <(sed 's/_[^,]*//g' file3)
The output:
SAMPLE-1,1,1,2,2,3,3
SAMPLE-2,1,1,2,2,3,3
SAMPLE-3,1,1,2,2,3,3
SAMPLE-4,1,1,,,,

shell - compare files and update matching string awk/sed/diff/grep/csv

I need to compare 2 csv files and make modifications to the second column. I wrote out the logic out of how I would want to achieve this however, it seems to confuse the thread a lot more than I wanted too so I'll just write out the example.
Any help would be appreciated. Thanks in advance.
file1
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName3
user4,distinguishedName4
user5,distinguishedName5
file2
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
desired outcome:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
The solution using join command combined with awk command:
join -t',' -j1 -a1 -a2 file1 file2 | awk -F',' '{if(NF==3) $0=$1FS$3}1'
The output:
user1,distinguishedName1
user2,distinguishedName2
user3,distinguishedName13
user4,distinguishedName4
user5,distinguishedName12
user6,distinguishedName4
Explanation:
-- for join command:
-t',' - defines field separator
-j1 - tells to join on first field 1
-a FILENUM - print unpairable lines coming from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2
-- for awk command:
NF - contains a total number of fields
FS - field separator(i.e. ,)
if(NF==3) $0=$1FS$3 - the condition, checks if there's a complement third field(as result of joining the files on lines with common first field) to perform the replacement
https://linux.die.net/man/1/join
awk to the rescue!
awk -F, '!a[$1]++' file2 file1
user1,distinguishedName1
user3,distinguishedName13
user5,distinguishedName12
user6,distinguishedName4
user2,distinguishedName2
user4,distinguishedName4
this order is based on file2 and file1 record order, if you want sorted order just pipe to sort
awk ... | sort

Compare two csv files, use the first three columns as identifier, then print common lines

I have two csv files. File 1 has 9861 rows and 4 columns while File 2 has 6037 rows and 5 columns.Here are the files.
Link of File 1
Link of File 2
The first three columns are years, months, days respectively.
I want to get the lines in File 2 with the same identifier in File 1 and print this to File 3.
I found this command from some posts here but this only works using one column as identifier:
awk -F, 'NR==FNR {a[$1]=$0;next}; $1 in a {print a[$1]; print}' file1 file2
Is there a way to do this using awk or any simpler commands where I can use the first three columns as identifier?
Ill appreciate any help.
Just use more columns to make the uniqueness you need:
$ awk -F, 'NR==FNR {a[$1, $2, $3] = $0; next}
$1 SUBSEP $2 SUBSEP $3 in a' file1 file2
SUBSEP
is the subscript separator. It has the default value of "\034", and is used to separate the parts of the indices of a multi-dimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"]
awk -F, '{k=$1 FS $2 FS $3} NR==FNR{a[k];next} k in a' file1 file2
Untested of course since you didn't provide any sample input/output.

comparing CSV files in ubuntu

I have two CSV files and I need to check for creations, updates and deletions. Take the following example files:
ORIGINAL FILE
sku1,A
sku2,B
sku3,C
sku4,D
sku5,E
sku6,F
sku7,G
sku8,H
sku9,I
sku10,J
UPDATED FILE
sku1,A
sku2,B-UPDATED
sku3,C
sku5,E
sku6,F
sku7,G-UPDATED
sku11, CREATED
sku8,H
sku9,I
sku4,D-UPDATED
I am using the linux comm command as follows:
comm -23 --nocheck-order updated_file.csv original_file > diff_file.csv
Which gives me all newly created and updated rows as follows
sku2,B-UPDATED
sku7,G-UPDATED
sku11, CREATED
sku4,D-UPDATED
Which is great but if you look closely "sku10,J" has been deleted and I'm not sure the best command/way to check for it. The data I have provided is merely demo, the text "sku" does not exist in the real data however column one of the CSV files are a unique 5 character indentifier. Any advice is appreciated.
I'd use join instead:
join -t, -a1 -a2 -eMISSING -o 0,1.2,2.2 <(sort file.orig) <(sort file.update)
sku1,A,A
sku10,J,MISSING
sku11,MISSING, CREATED
sku2,B,B-UPDATED
sku3,C,C
sku4,D,D-UPDATED
sku5,E,E
sku6,F,F
sku7,G,G-UPDATED
sku8,H,H
sku9,I,I
Then I'd pipe that into awk
join ... | awk -F, -v OFS=, '
$3 == "MISSING" {print "deleted: " $1,$2; next}
$2 == "MISSING" {print "added: " $1,$3; next}
$2 != $3 {print "updated: " $0}
'
deleted: sku10,J
added: sku11, CREATED
updated: sku2,B,B-UPDATED
updated: sku4,D,D-UPDATED
updated: sku7,G,G-UPDATED
This might be a really crude way of doing it, but if you are certain that the values in each file do not repeat, then:
cat file1.txt file2.txt | sort | uniq -u
If each file contains repeating strings, then you can sort|uniq them before concatenation.

Resources