Merge two files using awk and write the output - bash

I have two files with common field. I want to merge the two files with common field and write the merged file into another file using awk in linux command.
file1
412234$name1$value1$mark1
413233$raja$$mark2
414444$$$
file2
412234$sum$file2$address$street
413233$sum2$file32$address2$street2$path
414444$$$$
These sample files are seperated by $ and output merged file also will be in $. Also these rows have the empty field.
I tried the script using join:
join -t "$" out2.csv out1.csv |sort -un > file3.csv
But there is total number mismatching happened.
Tried with awk:
myawk.awk
#!/usr/bin/awk -f
NR==FNR{a[FNR]=$0;next} {print a[FNR],$2,$3}
I ran it
awk -f myawk.awk out2.csv out1.csv > file3.csv
It was also taking too much time. Not responding.
Here out2.csv is master file and we have to compare with out1.csv
Could you please help me to write the merged files into another file?

Run the following using bash. This gives you the equivalent of a full outer join
join -t'$' -a 1 -a 2 <(sort -k1,1 -t'$' out1.csv ) <(sort -k1,1 -t'$' out2.csv )

You were in the good direction with the awk solution. The main point was to change FS to split fields with $:
Content of script.awk:
awk '
BEGIN {
## Split fields with "$".
FS = "$"
}
## Save lines from second file, the first field as the index of the
## array, and rest of the line as the value.
FNR == NR {
file2[ $1 ] = substr( $0, index( $0, "$" ) )
next
}
## Print when keys from both files match.
FNR < NR {
if ( $1 in file2 ) {
printf "%s$%s\n", $0, file2[ $1 ]
}
}
' out2.csv out1.csv
Output:
412234$name1$value1$mark1$$sum$file2$address$street
413233$raja$$mark2$$sum2$file32$address2$street2$path
414444$$$$$$$$

Related

Split a large gz file into smaller ones filtering and distributing content

I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:
col_1 col_2. col_3. col_4. col_5. col_6
1. 7464 sam. NY. 0.738. 28.9
1. 81932. Dave. NW. 0.163. 91.9
2. 162. Peter. SD. 0.7293. 673.1
3. 7193. Ooni GH. 0.746. 6391
3. 6139. Jess. GHD. 0.8364. 81937
3. 7291. Yeldish HD. 0.173. 1973
File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was
#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1
awk -F, '{print > $1".csv.gz"}' file.csv.gz
But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles.
Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.
You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut like this:
gunzip -c infile.gz \
| cut --complement -f3,5 \
| awk '{ print | "gzip > " $1 "csv.gz" }'
Or you could get rid of the columns in awk:
gunzip -c infile.gz \
| awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'
Something like
zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'
Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.
Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:
gunzip -c infile.gz |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
otherwise:
gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
The first awk adds a number at the front to ensure the header line sorts before the rest during the sort phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.

Consolidate two tables awk

I have two files (all tab delimited):
database.txt
MAR001;string1;H
MAR002;string2;G
MAR003;string3;H
data.txt
data1;MAR002
data2;MAR003
And I want to consolidate these two tables using the MAR### column. Expected output (tab-delimited):
data1;MAR002;string2;G
data2;MAR003;string3;H
I want to use awk; this is my attempt:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$2] = $1; next } $2 in a { print $0, a[$1] }' data.txt database.txt
but this fails...
I would just use the join command. It's very easy:
join -t \; -1 1 -2 2 database.txt data.txt
MAR002;string2;G;data1
MAR003;string3;H;data2
You can specify output column order using -o. For example:
join -t \; -1 1 -2 2 -o 2.1,2.2,1.2,1.3 database.txt data.txt
data1;MAR002;string2;G
data2;MAR003;string3;H
P.S. I did assume your files are "semicolon separated" and not "tab separated". Also, your files need to be sorted by the key column.
awk -F '\t' 'FNR==1 && NR == 1 { strt=1 } FNR==1 && NR != 1 { strt=0} strt==1 {dat[$1]=$2";"$3 } strt==0 { if ( dat[$2] != "" ) { print $1";"$2";"dat[$2] } }' database.txt data.txt
Read database.txt in first and read the data into an array dat. Then when we encounter the data.txt file, check for entries in the dat array and print the required data if there is one.
Output:
data1;MAR002;string2;G
data2;MAR003;string3;H
First of all ; and \t are different characters. If your real input files are tab delimited, here is the fix on your codes:
Change your codes into:
awk '....... $1 in a { print a[$1], $0 }' data.txt database.txt

using awk how to merge 2 files, say A & B and do a left outer join function and include all columns in both files

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.
I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.
awk -F ',' '{
if (NR==FNR){ r[$1]=$0}
else{ if($1 in r)
r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2
file1
number,column1,column2,..columnN
File2
numbr,column1,column2,..columnN
Output
number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN
file1
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
desired output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
if ( FNR == 1 ) {
empty = gensub(/[^,]/,"","g",tail)
}
file2[$1] = tail
next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.
An interesting (to me at least!) alternative you might want to test for a performance difference is:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
tail = gensub(/[^,]*,/,"",1)
idx[$1] = NR
file2[NR] = tail
if ( FNR == 1 ) {
file2[""] = gensub(/[^,]/,"","g",tail)
}
next
}
{ print $0, file2[idx[$1]] }
$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
but I don't really expect it to be any faster and it MAY even be slower.
you can try,
awk 'BEGIN{FS=OFS=","}
FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
{print $0, ($1 in d?d[$1]:",")}' file2 file1
you get,
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
join to the rescue:
$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt
Explanation:
-t $',': Field separator token.
-a 1: Do not discard records from file 1 if not present in file 2.
-e '': Missing records will be treated as an empty field.
-o: Output format.
file1.txt
1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c
file2.txt
1,x,y
2,x,y
5,x,y
6,x,y
7,x,y
Output
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

Merging output of two awk acripts into 1 file

I have a large input file with 150+ columns and 50M rows, a sample of which is shown here:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1
I have a bash shell script:
function awkScript() {
awk -F, -v cols="$1" -v hdr="$2" '
BEGIN {OFS=FS}
NR==1 {n=split(cols,cn);
for(i=1;i<=NF;i++)
for(j=1;j<=n;j++)
if($i==cn[j]) c[++k]=i;
$(NF+1)=hdr}
NR >1 {v1=$c[1]; v2=$c[2]; v3=$c[3]
if(!v2 && !v3) $(NF+1) = v1?10:0
else $(NF+1) = v3?(v1-v3)/v3:0 + v2?(v1-v2)/v2:0}1' "$3"
}
function awkScript1() {
awk -F, -v cols="$1" -v hdr="$2" '
BEGIN {OFS=FS}
NR==1 {n=split(cols,cn);
for(i=1;i<=NF;i++)
for(j=1;j<=n;j++)
if($i==cn[j]) c[++k]=i;
$(NF+1)=hdr}
NR >1 {v1=$c[1]; v2=$c[2]; v3=$c[3]; v4=$c[4]
$(NF+1) = v1?(v1/(v1+v2+v3+v4)):0
}1' "$3"
}
function awkScriptWrapper() {
awkScript "$1" "$2"
}
function awkScriptWrapper1() {
awkScript1 "$1" "$2"
}
awkScript "c1,c2,c3" "Header1" "input.txt" | awkScriptWrapper "c4,c5,c6" "Header2" >> output.txt
awkScript1 "c7,c8,c9,c10" "Header3" "input.txt" | awkScriptWrapper1 "c11,c12,c13,c14" "Header4" >> output1.txt
Sample of output.txt is:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1
Sample of output1.txt is:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,1,0.5
My requirement is that I have to append Header1,Header2,Header3,Header4 into the end of the same input file i.e., the above script should produce just 1 output file "finaloutput.txt":
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1,1,0.5
I tried doing the following statements:
awkScript "c1,c2,c3" "Header1" "input.txt" | awkScriptWrapper "c4,c5,c6" "Header2" >> temp_output.txt
awkScript1 "c7,c8,c9,c10" "Header3" "temp_output.txt" | awkScriptWrapper1 "c11,c12,c13,c14" "Header4" >> finaloutput.txt
But I'm not getting it.
Any help would be much appreciated.
Assuming that you need to join two commands in a pipeline:
$ cmd1 | join --header -j1 -t, -o1.{1..17} -o2.16,2.17 - <(cmd2)
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1,1,0.5
The above assumes that cmd1 outputs:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header1,Header2
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,-1
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,-1,-1
While cmd2 outputs:
id,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,Header3,Header4
1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0
2,0,0,1,0,0,1,1,0,0,0,1,0,0,1,1,0.5
How does it work?
--header will treat the first line in each file as fields headers
-j1 will joint on field one
-t, specified , as a field delimiter
-o xxx will specify output columns, 1.1 means columns one from file one, in this case cmd1. 2.1 means columns one from file two, in this case cmd2
-o1.{1..17} will expand to:
-o1.1 -o1.2 -o1.3 -o1.4 -o1.5 -o1.6 -o1.7 -o1.8 -o1.9 -o1.10 -o1.11 -o1.12 -o1.13 -o1.14 -o1.15 -o1.16 -o1.17
And is a quick way to specify the first 17 columns from cmd1.
- refers to standard input, which in this case it the output from cmd1
<(command) is a process substitution.
You can change to:
join [options] file1 file2
if you need to join two regular files.

comparing CSV files in ubuntu

I have two CSV files and I need to check for creations, updates and deletions. Take the following example files:
ORIGINAL FILE
sku1,A
sku2,B
sku3,C
sku4,D
sku5,E
sku6,F
sku7,G
sku8,H
sku9,I
sku10,J
UPDATED FILE
sku1,A
sku2,B-UPDATED
sku3,C
sku5,E
sku6,F
sku7,G-UPDATED
sku11, CREATED
sku8,H
sku9,I
sku4,D-UPDATED
I am using the linux comm command as follows:
comm -23 --nocheck-order updated_file.csv original_file > diff_file.csv
Which gives me all newly created and updated rows as follows
sku2,B-UPDATED
sku7,G-UPDATED
sku11, CREATED
sku4,D-UPDATED
Which is great but if you look closely "sku10,J" has been deleted and I'm not sure the best command/way to check for it. The data I have provided is merely demo, the text "sku" does not exist in the real data however column one of the CSV files are a unique 5 character indentifier. Any advice is appreciated.
I'd use join instead:
join -t, -a1 -a2 -eMISSING -o 0,1.2,2.2 <(sort file.orig) <(sort file.update)
sku1,A,A
sku10,J,MISSING
sku11,MISSING, CREATED
sku2,B,B-UPDATED
sku3,C,C
sku4,D,D-UPDATED
sku5,E,E
sku6,F,F
sku7,G,G-UPDATED
sku8,H,H
sku9,I,I
Then I'd pipe that into awk
join ... | awk -F, -v OFS=, '
$3 == "MISSING" {print "deleted: " $1,$2; next}
$2 == "MISSING" {print "added: " $1,$3; next}
$2 != $3 {print "updated: " $0}
'
deleted: sku10,J
added: sku11, CREATED
updated: sku2,B,B-UPDATED
updated: sku4,D,D-UPDATED
updated: sku7,G,G-UPDATED
This might be a really crude way of doing it, but if you are certain that the values in each file do not repeat, then:
cat file1.txt file2.txt | sort | uniq -u
If each file contains repeating strings, then you can sort|uniq them before concatenation.

Resources