comparing CSV files in ubuntu - bash

I have two CSV files and I need to check for creations, updates and deletions. Take the following example files:
ORIGINAL FILE
sku1,A
sku2,B
sku3,C
sku4,D
sku5,E
sku6,F
sku7,G
sku8,H
sku9,I
sku10,J
UPDATED FILE
sku1,A
sku2,B-UPDATED
sku3,C
sku5,E
sku6,F
sku7,G-UPDATED
sku11, CREATED
sku8,H
sku9,I
sku4,D-UPDATED
I am using the linux comm command as follows:
comm -23 --nocheck-order updated_file.csv original_file > diff_file.csv
Which gives me all newly created and updated rows as follows
sku2,B-UPDATED
sku7,G-UPDATED
sku11, CREATED
sku4,D-UPDATED
Which is great but if you look closely "sku10,J" has been deleted and I'm not sure the best command/way to check for it. The data I have provided is merely demo, the text "sku" does not exist in the real data however column one of the CSV files are a unique 5 character indentifier. Any advice is appreciated.

I'd use join instead:
join -t, -a1 -a2 -eMISSING -o 0,1.2,2.2 <(sort file.orig) <(sort file.update)
sku1,A,A
sku10,J,MISSING
sku11,MISSING, CREATED
sku2,B,B-UPDATED
sku3,C,C
sku4,D,D-UPDATED
sku5,E,E
sku6,F,F
sku7,G,G-UPDATED
sku8,H,H
sku9,I,I
Then I'd pipe that into awk
join ... | awk -F, -v OFS=, '
$3 == "MISSING" {print "deleted: " $1,$2; next}
$2 == "MISSING" {print "added: " $1,$3; next}
$2 != $3 {print "updated: " $0}
'
deleted: sku10,J
added: sku11, CREATED
updated: sku2,B,B-UPDATED
updated: sku4,D,D-UPDATED
updated: sku7,G,G-UPDATED

This might be a really crude way of doing it, but if you are certain that the values in each file do not repeat, then:
cat file1.txt file2.txt | sort | uniq -u
If each file contains repeating strings, then you can sort|uniq them before concatenation.

Related

Filter records from one file based on a values present in another file using Unix

I have an Input csv file Input feed
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
There is an output error csv file which is generated from this input file which has the Primary Key
Error File
Pk,Error_Reason
D,Failure
E, Failure
F, Failure
I want to extract all the records from the input file and save it into a new file for which there is a Primary key entry in Error file.
Basically my new file should look like this:
New Input feed
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
I am a beginner in Unix and I have tried Awk command.
The Approach I have tried is, get all the primary key values into a file.
akw -F"," '{print $2}' error.csv >> error_pk.csv
Now I need to filter out the records from the input.csv for all the primary key values present in error.pk
Using awk. As there is leading space in the error file, it needs to be trimmend off first, I'm using sub for that. Then, since the titles of the first column are not identical, (PK vs Pk) that needs to be handled separately with FNR==1:
$ awk -F, ' # set separator
NR==FNR { # process the first file
sub(/^ */,"") # trim leading space
a[$1] # hash the first column
next
}
FNR==1 || ($1 in a)' error input # output tthe header record and if match hashed
Output:
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
You can use join.
First remove everything afte the comma from second file
Join on the first field from both files
cat <<EOF >file1
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
EOF
cat <<EOF >file2
PK,Error_Reason
D,Failure
E,Failure
F,Failure
EOF
join -t, -11 -21 <(sort -k1 file1) <(cut -d, -f1 file2 | sort -k1)
If you need the file to be sorted according to file1, you can number the lines in first file, join the files, re-sort using the line numbers and then remove the numbers from the output:
join -t, -12 -21 <(nl -w1 -s, file1 | sort -t, -k2) <(cut -d, -f1 file2 | sort -k1) |
sort -t, -k2 | cut -d, -f1,3-
You can use grep -f with a file with search items. Cut off at the ,.
grep -Ef <(sed -r 's/([^,]*).*/^\1,/' file2) file1
When you want a header in the output,

bash: using 2 variables from same file and sed

I have a 2 files:
file1.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T 45000079
rs111285978:45000103:A:AT 45000103
rs190363568:45000168:C:T 45000168
file2.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T rs142159069
rs111285978:45000103:A:AT rs111285978
rs190363568:45000168:C:T rs190363568
Using file2.txt, I want to replace the names (column2 of file1.txt which is column1 of file2.txt) by the entry in column 2. The output file would then be:
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
I have tried inputing the columns of file2.txt but without success:
while read -r a b
do
cat file1.txt | sed s'/$a/$b/'
done < file2.txt
I am quite new to bash. Also, not sure how to write an output file with my command. Any help would be deeply appreciated.
In your case, using awk or perl would be easier, if you are willing to accept an answer without sed:
awk '(NR==FNR){out[$1]=$2;next}{out[$1]=out[$1]" "$2}END{for (i in out){print out[i]} }' file2.txt file1.txt > output.txt
output.txt :
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
Note: this assume all symbols in column1 are unique, and that they are all present in both files
explanation:
(NR==FNR){out[$1]=$2;next} : while you are parsing the first file, create a map with the name from the first column as key
{out[$1]=out[$1]" "$2} : append the value from the second column
END{for (i in out){print out[i]} } : print all the values in the map
Apparently $2 of file2 is part of $1 of file1, so you could use awk and redefine FS:
$ awk -F"[: ]" '{print $1,$NF}' file1
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168

How to merge multiple .csv files using the 1st column of one of them as a index (shell scripting)

How to merge multiple .csv files using the 1st column of one of them as an index (pref shell scripting - awk)
88 .csv files that look like this
input files names ZBND19X.csv
==> ZBND19X.csv <==
Gene,ZBND19X(26027342 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471
ENSTGUG00000000915,862.597795025
ENSTGUG00000006651 (ARPP19),845.045872644
ENSTGUG00000005054 (CAMKV),823.404021741
ENSTGUG00000005949 (FTH1),585.628487964
and ZBND22V.csv
==> ZBND39X.csv <==
Gene,ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),971.678203888
ENSTGUG00000005054 (CAMKV),687.81249397
ENSTGUG00000006651 (ARPP19),634.296191033
ENSTGUG00000002582 (ITM2A),613.756010638
ENSTGUG00000000915,588.002298061
output file name RPKM_all.csv
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000005949 (FTH1),585.628487964,0
ENSTGUG00000002582 (ITM2A),613.756010638,0
Adding the 0 when there is no corresponding value found.
join can only work on two files at a time, here comes
awk to the rescue!
$ awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next}
{ks[$1]; a[$1,c]=$2}
END {print h;
for(k in ks)
{printf "%s", k;
for(i=1;i<=c;i++) printf "%s", FS a[k,i]+0;
print ""}}' files
disclaimier: only if the data can fit in memory, also the order will be lost but if important there are ways to handle it.
Explanation Conceptually creating a table (aka 2D array, matrix) and filling up the entries. THe rows are indexed by key and columns by file number. Since awk array is hashing the keys we treat header separately to stay in place. a[k,i]+0 is to convert missing elements to 0.
The simple answer is 'join'.
You can use the join command to match on the first column ( by default ) as long as the files are sorted.
Don't forget to sort your files.
Did I mention you need to sort your files ;)? It's an easy mistake to make ( I've made that mistake plenty; hence the emphasis ).
sort ZBND19X.csv > ZBND19X.csv.sorted
sort ZBND39X.csv > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
Here's the contents of RPKM_all.csv after running above :
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
We can also look for rows that don't match like this:
$ join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}'
ENSTGUG00000005949 (FTH1),585.628487964,0
$ join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}'
ENSTGUG00000002582 (ITM2A),0,613.756010638
Now you can combine the whole thing:
sort ZBND19X.csv > ZBND19X.csv.sorted
sort ZBND39X.csv > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}' >> RPKM_all.csv
join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}' >> RPKM_all.csv
the awk code (awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ):
does anyone can do more explanation on this, the code doesn't print the header correctly, all the headers just jump to different rows and the first header is missing too
P21
P22
P24
P24
AamoA_EU022762 1 1 0 0
AamoA_EU099963 0 1 0 0

Joining two csv files in bash

I have to join two files by values in one column. I need to use unix bash.
My first file looks like this:
user_id, song_id, timestamp
00001638d6189236866af9bbf309ae6c2347ffdc,SOBBMDR12A8C13253B,1203083335
00001638d6189236866af9bbf309ae6c2347ffdc,SOBXALG12A8C13C108,984663773
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,SODDNQT12A6D4F5F7E,1275071044
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,SODDNQT12A6D4F5F7E,1097509573
Second file:
user_id, natural_key
00000b722001882066dff9d2da8a775658053ea0,6944471
00001638d6189236866af9bbf309ae6c2347ffdc,19309784
0000175652312d12576d9e6b84f600caa24c4715,10435505
00001cf0dce3fb22b0df0f3a1d9cd21e38385372,5232769
Of course both files have many more rows. I would like to join both files by first column (user_id) and get this result:
natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573
I tried to do something with join and awk but to no avail. Could anyone help?
With GNU join, sed, sort and bash:
echo "natural_key, song_id, timestamp"
join -t, <(sed '1d' file1 |sort -t, -k1,1) <(sed '1d' file2 | sort -t, -k1,1) -o 2.2,1.2,1.3
Output:
natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1097509573
5232769,SODDNQT12A6D4F5F7E,1275071044
This one in GNU awk (regex FS). That header spacing in your example I'm just going to ignore:
$ awk 'BEGIN{FS=", ?";OFS=","}NR==FNR{a[$1]=$2;next}$1 in a{print a[$1],$2,$3}' file2 file1
natural_key,song_id,timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573
Explained:
$ awk '
BEGIN { FS=", ?"; OFS="," } # set the delimiters
NR==FNR { a[$1]=$2; next } # hash the first file in paramaters
$1 in a { print a[$1], $2, $3 } # if key is found in hash, output
' file2 file1 # mind the order
Using the mlr util:
mlr --csvlite join -j user_id -f f1.csv \
then cut -o -f ' natural_key',' song_id',' timestamp' f2.csv
Output:
natural_key, song_id, timestamp
19309784,SOBBMDR12A8C13253B,1203083335
19309784,SOBXALG12A8C13C108,984663773
5232769,SODDNQT12A6D4F5F7E,1275071044
5232769,SODDNQT12A6D4F5F7E,1097509573
Note the leading spaces in the headers. These are left intact here because:
Most of the source data headers have leading spaces, but the data does not.
The leading spaces, if unquoted, will fail with most CSV oriented utils.

Merge two files using awk and write the output

I have two files with common field. I want to merge the two files with common field and write the merged file into another file using awk in linux command.
file1
412234$name1$value1$mark1
413233$raja$$mark2
414444$$$
file2
412234$sum$file2$address$street
413233$sum2$file32$address2$street2$path
414444$$$$
These sample files are seperated by $ and output merged file also will be in $. Also these rows have the empty field.
I tried the script using join:
join -t "$" out2.csv out1.csv |sort -un > file3.csv
But there is total number mismatching happened.
Tried with awk:
myawk.awk
#!/usr/bin/awk -f
NR==FNR{a[FNR]=$0;next} {print a[FNR],$2,$3}
I ran it
awk -f myawk.awk out2.csv out1.csv > file3.csv
It was also taking too much time. Not responding.
Here out2.csv is master file and we have to compare with out1.csv
Could you please help me to write the merged files into another file?
Run the following using bash. This gives you the equivalent of a full outer join
join -t'$' -a 1 -a 2 <(sort -k1,1 -t'$' out1.csv ) <(sort -k1,1 -t'$' out2.csv )
You were in the good direction with the awk solution. The main point was to change FS to split fields with $:
Content of script.awk:
awk '
BEGIN {
## Split fields with "$".
FS = "$"
}
## Save lines from second file, the first field as the index of the
## array, and rest of the line as the value.
FNR == NR {
file2[ $1 ] = substr( $0, index( $0, "$" ) )
next
}
## Print when keys from both files match.
FNR < NR {
if ( $1 in file2 ) {
printf "%s$%s\n", $0, file2[ $1 ]
}
}
' out2.csv out1.csv
Output:
412234$name1$value1$mark1$$sum$file2$address$street
413233$raja$$mark2$$sum2$file32$address2$street2$path
414444$$$$$$$$

Resources