extract columns from multiple .csv files and merge them into one - bash

I have three files from which I want to extract some columns and paste them in a new file. The files don't necessarily have the same number of lines. They are sorted on the values in their first column.
File 1 has the following structure:
col1;col2;col3;col4
SAMPLE-1;1;1;1
SAMPLE-2;1;1;1
SAMPLE-3;1;1;1
SAMPLE-4;1;1;1
This file is seperated by ";" instead of ","
File 2 has the following structure:
col5,col6,col7,col8
SAMPLE-1_OTHER_INFO,2,2,2
SAMPLE-2_OTHER_INFO,2,2,2
SAMPLE-3_OTHER_INFO,2,2,2
File 3 has the following structure:
col9,col10,col11,col12
SAMPLE-1_OTHER_INFO,3,3,3
SAMPLE-2_OTHER_INFO,3,3,3
SAMPLE-3_OTHER_INFO,3,3,3
The output file (summary.csv) should look like this:
col1,col2,col4,col6,col7,col10,col12
SAMPLE-1,1,1,2,2,3,3
SAMPLE-2,1,1,2,2,3,3
SAMPLE-3,1,1,2,2,3,3
SAMPLE-4,1,1,,,,
Basically the first columns of all three files contain the sample identifier. 'col1' of file1 should be the first column of the output file. The identifiers in col1 should then be matched with those in col5 and col9 of file2 and file3. The '_OTHER_INFO' part should not be taken into account when doing the comparison.
If there is a match, the info the col6, col7, col10 and col12 values of file 2 and 3 should be added.
If there is no match, the line should still be in the output file, but the last four columns should be empty (like in this case 'SAMPLE-4')
I was planning to perform this action with awk or the 'cut/paste' command. However I don't know how I should look for a match between the values in col1, col5 and col9.

try following and let me know if this helps you.
awk 'BEGIN{
FS=";"
}
FNR==1{
f++
}
f==1 && FNR>1{
a[$1]=$2","$4;
next
}
f>1 && FNR==1 {
FS=","
}
f==2 && FNR>1{
sub(/_.*/,"",$1);
b[$1]=$2","$3;
next
}
f==3 && FNR>1{
sub(/_.*/,"",$1);
c[$1]=$2","$4;
next
}
END{
print "col1,col2,col4,col6,col7,col10,col12";
for(i in a){
printf("%s,%s,%s,%s\n",i,a[i],b[i]?b[i]:",",c[i]?c[i]:",")
}
}
' file1 file2 file3
Will try to add explanation too in sometime.
EDIT1: adding a one-liner form of solution too.
awk 'BEGIN{FS=";"}FNR==1{f++} f==1 && FNR>1{;a[$1]=$2","$4;next} f>1 && FNR==1{FS=","} f==2&&FNR>1{sub(/_.*/,"",$1);b[$1]=$2","$3;next} f==3&&FNR>1{sub(/_.*/,"",$1);c[$1]=$2","$4;next} END{print "col1,col2,col4,col6,col7,col10,col12";for(i in a){printf("%s,%s,%s,%s\n",i,a[i],b[i]?b[i]:",",c[i]?c[i]:",")}}' file1 file2 file3

sort + sed trick (for sorted input files):
join -t, -j1 -a1 -o1.1,1.2,1.4,2.2,2.3 <(tr ';' ',' < file1) <(sed 's/_[^,]*//g' file2)
| join -t, - -a1 -o1.1,1.2,1.3,1.4,1.5,2.2,2.4 <(sed 's/_[^,]*//g' file3)
The output:
SAMPLE-1,1,1,2,2,3,3
SAMPLE-2,1,1,2,2,3,3
SAMPLE-3,1,1,2,2,3,3
SAMPLE-4,1,1,,,,

Related

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

compare two columns in awk and print values from lookup files into output file

I have two files with first file having ~16000 files and the second file is lookup file having ~4000 lines.
Sample contents of file1 is given below:
id,title,name,value,details
01,23456, , ,abcdefg
02,23456, , ,abcdefg
03,12345, , ,abcdefg
04,34534, , ,abcdefg
...
Sample contents of lookup file file2 is given below:
sno,title,name,value
1,23456,abc,xyz
2,12345,cde,efg
3,34534,543,234
Now my requirement is compare column 2 of file1 in the lookup file and insert the values of column3 and column4 from lookup file into new output file.
The output file should look like below:
id,title,name,value,details
01,23456,abc,xyz,abcdefg
02,23456,abc,xyz,abcdefg
03,12345,cde,efg,abcdefg
04,34534,543,234,abcdefg
I did try few iterations by looking at existing questions but didn't get the results I desired. Any solution with awk would be much helpful.
$ cat vino.awk
BEGIN { FS = OFS = "," }
NR==FNR { name[$2]=$3; value[$2]=$4; next }
{ print $1, $2, name[$2], value[$2], $5 }
$ cat file1
id,title,name,value,details
01,23456, , ,abcdefg
02,23456, , ,abcdefg
03,12345, , ,abcdefg
04,34534, , ,abcdefg
$ cat file2
sno,title,name,value
1,23456,abc,xyz
2,12345,cde,efg
3,34534,543,234
$ awk -f vino.awk file2 file1
id,title,name,value,details
01,23456,abc,xyz,abcdefg
02,23456,abc,xyz,abcdefg
03,12345,cde,efg,abcdefg
04,34534,543,234,abcdefg
Here's an awk oneliner:
awk -F, 'FNR==NR {n[$2]=$3;v[$2]=$4} FNR!=NR{OFS=","; print $1,$2,n[$2],v[$2],$5}' file2 file1
The idea is to process in two passes, first for file2 to store all of the names and values, then for file1, to print out each line including the collected names and values.
awk -F"," 'BEGIN{OFS=","} NR==FNR {a[$2]=$3","$4;next} {print $1,$2,a[$2],$5;}' file2 file1

deleting duplicate columns from csv file

I've got perfmon outputting to a csv and I need to delete any repeated columns, e.g.
COL1, Col2, Col3, COL1, Col4, Col5
When columns repeat it's almost always the same column but it doesn't happen every time. What I've got so far are a couple of manual steps:
When the column count is greater than it should be I output all of the column headers on single lines:
head -n1 < output.csv|sed 's/,/\n/g'
Then, when I know which column numbers are guilty, I delete manually, e.g.:
cut -d"," --complement -f5,11 < output.csv > output2.csv
If somebody can point me in the right direction I'd be grateful!
Updated to give rough example of output.csv contents, should be familiar to anyone who's used perfmon:
"COLUMN1","Column2","Column3","COLUMN1","Column4"
"1","1","1","1","1"
"a","b","c","a","d"
"x","dd","ffd","x","ef"
I need to delete the repeated COLUMN1 (4th col)
Just to be clear, I'm trying to think of a way of automatically going into output.csv and deleting repeated columns without having to tell it which columns to delete a la my manual method above. Thanks!
try this awk (not really one-liner), it handles more than one duplicated columns, it checks only the title (first row) to decide which columns are duplicated. Your example shows in this way too.
awk script (one-liner version):
awk -F, 'NR==1{for(i=1;i<=NF;i++)if(!($i in v)){ v[$i];t[i]}}{s=""; for(i=1;i<=NF;i++)if(i in t)s=s sprintf("%s,",$i);if(s){sub(/,$/,"",s);print s}} ' file
clear version (same script):
awk -F, 'NR==1{
for(i=1;i<=NF;i++)
if(!($i in v)){v[$i];t[i]}
}
{s=""
for(i=1;i<=NF;i++)
if(i in t)
s=s sprintf("%s,",$i)
if(s){
sub(/,$/,"",s)
print s
}
} ' file
with example (note I created two duplicated cols ):
kent$ cat file
COL1,COL2,COL3,COL1,COL4,COL2
1,2,3,1,4,2
a1,a2,a3,a1,a4,a2
b1,b2,b3,b1,b4,b2
d1,d2,d3,d1,d4,d2
kent$ awk -F, 'NR==1{
for(i=1;i<=NF;i++)
if(!($i in v)){v[$i];t[i]}
}
{s=""
for(i=1;i<=NF;i++)
if(i in t)
s=s sprintf("%s,",$i)
if(s){
sub(/,$/,"",s)
print s
}
} ' file
COL1,COL2,COL3,COL4
1,2,3,4
a1,a2,a3,a4
b1,b2,b3,b4
d1,d2,d3,d4

Merge two files using awk and write the output

I have two files with common field. I want to merge the two files with common field and write the merged file into another file using awk in linux command.
file1
412234$name1$value1$mark1
413233$raja$$mark2
414444$$$
file2
412234$sum$file2$address$street
413233$sum2$file32$address2$street2$path
414444$$$$
These sample files are seperated by $ and output merged file also will be in $. Also these rows have the empty field.
I tried the script using join:
join -t "$" out2.csv out1.csv |sort -un > file3.csv
But there is total number mismatching happened.
Tried with awk:
myawk.awk
#!/usr/bin/awk -f
NR==FNR{a[FNR]=$0;next} {print a[FNR],$2,$3}
I ran it
awk -f myawk.awk out2.csv out1.csv > file3.csv
It was also taking too much time. Not responding.
Here out2.csv is master file and we have to compare with out1.csv
Could you please help me to write the merged files into another file?
Run the following using bash. This gives you the equivalent of a full outer join
join -t'$' -a 1 -a 2 <(sort -k1,1 -t'$' out1.csv ) <(sort -k1,1 -t'$' out2.csv )
You were in the good direction with the awk solution. The main point was to change FS to split fields with $:
Content of script.awk:
awk '
BEGIN {
## Split fields with "$".
FS = "$"
}
## Save lines from second file, the first field as the index of the
## array, and rest of the line as the value.
FNR == NR {
file2[ $1 ] = substr( $0, index( $0, "$" ) )
next
}
## Print when keys from both files match.
FNR < NR {
if ( $1 in file2 ) {
printf "%s$%s\n", $0, file2[ $1 ]
}
}
' out2.csv out1.csv
Output:
412234$name1$value1$mark1$$sum$file2$address$street
413233$raja$$mark2$$sum2$file32$address2$street2$path
414444$$$$$$$$

BASH - Join on non-first column

I am trying to join 2 files together - both files are in CSV format - both files have the same columns. Here is an example of each file :
File 1:
CustName,AccountReference,InvoiceDate,InvoiceRefID,TelNo,Rental,GPRS,Mnet,MnetPlus,SMS,CSD,IntRoaming,NetAmount
acme,107309 ,2011-09-24 12:47:11.000,AP/157371,07741992165 ,2.3900,.0000,.0000,.0000,.0000,.0000,.0000,2.3900
acme,107309 ,2011-09-24 12:58:32.000,AP/162874,07740992165 ,2.0000,.0000,.0000,.0000,.0000,.0000,.0000,2.0000
anot,107308 ,2011-09-24 12:58:32.000,AP/162874,07824912428 ,2.0000,.0000,.0000,.0000,.0000,.0000,.0000,2.0000
anot,107308 ,2011-09-24 12:47:11.000,AP/157371,07834919928 ,1.5500,.0000,.0000,.0000,.0000,.0000,.0000,1.5500
File 2:
CustName,AccountReference,InvoiceDate,InvoiceRefID,TelNo,Rental,GPRS,Mnet,MnetPlus,SMS,CSD,IntRoaming,NetAmount
acme,100046,2011-10-05 08:29:19,AB/020152,07824352342,12.77,0.00,0.00,0.00,0.00,0.00,0.00,12.77
anbe,100046,2011-10-05 08:29:19,AB/020152,07741992165,2.50,0.00,0.00,0.00,0.00,0.00,0.00,2.50
acve,100046,2011-10-05 08:29:19,AB/020152,07740992165,10.00,0.00,0.00,0.00,0.00,0.00,0.00,10.00
asce,100046,2011-10-05 08:29:19,AB/020152,07771335702,2.50,0.00,0.00,0.00,0.00,0.00,0.00,2.50
I would like to join the 2 files together - but just taking some of the columns the other columns can be ignored (some are the same, some are different) -
AccountRef,telno, rental_file1,rental_file2,gprs_file1,gprs_file2 etc etc ....
The join should be done on the telno column (it seems I have white space in file 1 - hope that can be ignored ?
i have found lots of examples using JOIN but all of them use the first column for the key on the join .... any pointers would be great - thanks
The basic answer is:
join -t , -1 3 -2 4 -1 6 -2 2 file1 file2
This will join the files file1 and file2 on column 3 from file with column 4 from file2, then on columns 6 and 2. The data files must be sorted on those same columns, of course. The -t , sets the separator for CSV - but join will not handle embedded commas inside quoted strings.
If your data is simple (no quoted strings) then you can also use awk. If your data has quoted strings which may contain commas, etc, then you need a CSV-aware tool. I'd probably use Perl with the Text::CSV module (and the Text::CSV_XS module for performance).
awk -F' *, *' 'NR > 1 && NR == FNR {
_[$5] = $0; next
}
NR == 1 {
print "AccountReference", "TelNo", "Rental_" ARGV[2], \
"Rental_" ARGV[3], "GPRS_" ARGV[2], "GPRS_" ARGV[3]
next
}
$5 in _ {
split(_[$5], t)
print $2, $5, $6, t[6], $7, t[7]
}' OFS=, file2 file1
Have a look at cat and cut :-)
For instance
cat file1 file2 | cut -d, -f2,5
yields
107309 ,07741992165
107309 ,07740992165
107308 ,07824912428
107308 ,07834919928
100046,07824352342
100046,07741992165
100046,07740992165
100046,07771335702
All the GNU utilities documented here:
http://www.gnu.org/s/coreutils/manual/html_node/index.html#Top
For your problem, see cat, cut, sort, uniq and join.

Resources