How to merge multiple .csv files using the 1st column of one of them as a index (shell scripting) - shell

How to merge multiple .csv files using the 1st column of one of them as an index (pref shell scripting - awk)
88 .csv files that look like this
input files names ZBND19X.csv
==> ZBND19X.csv <==
Gene,ZBND19X(26027342 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471
ENSTGUG00000000915,862.597795025
ENSTGUG00000006651 (ARPP19),845.045872644
ENSTGUG00000005054 (CAMKV),823.404021741
ENSTGUG00000005949 (FTH1),585.628487964
and ZBND22V.csv
==> ZBND39X.csv <==
Gene,ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),971.678203888
ENSTGUG00000005054 (CAMKV),687.81249397
ENSTGUG00000006651 (ARPP19),634.296191033
ENSTGUG00000002582 (ITM2A),613.756010638
ENSTGUG00000000915,588.002298061
output file name RPKM_all.csv
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000005949 (FTH1),585.628487964,0
ENSTGUG00000002582 (ITM2A),613.756010638,0
Adding the 0 when there is no corresponding value found.

join can only work on two files at a time, here comes
awk to the rescue!
$ awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next}
{ks[$1]; a[$1,c]=$2}
END {print h;
for(k in ks)
{printf "%s", k;
for(i=1;i<=c;i++) printf "%s", FS a[k,i]+0;
print ""}}' files
disclaimier: only if the data can fit in memory, also the order will be lost but if important there are ways to handle it.
Explanation Conceptually creating a table (aka 2D array, matrix) and filling up the entries. THe rows are indexed by key and columns by file number. Since awk array is hashing the keys we treat header separately to stay in place. a[k,i]+0 is to convert missing elements to 0.

The simple answer is 'join'.
You can use the join command to match on the first column ( by default ) as long as the files are sorted.
Don't forget to sort your files.
Did I mention you need to sort your files ;)? It's an easy mistake to make ( I've made that mistake plenty; hence the emphasis ).
sort ZBND19X.csv > ZBND19X.csv.sorted
sort ZBND39X.csv > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
Here's the contents of RPKM_all.csv after running above :
ENSTGUG00000000915,862.597795025,588.002298061
ENSTGUG00000005054 (CAMKV),823.404021741,687.81249397
ENSTGUG00000006651 (ARPP19),845.045872644,634.296191033
ENSTGUG00000013338 (GAPDH),984.31862471,971.678203888
Gene,ZBND19X(26027342 pairs),ZBND39X(26558640 pairs)
We can also look for rows that don't match like this:
$ join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}'
ENSTGUG00000005949 (FTH1),585.628487964,0
$ join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}'
ENSTGUG00000002582 (ITM2A),0,613.756010638
Now you can combine the whole thing:
sort ZBND19X.csv > ZBND19X.csv.sorted
sort ZBND39X.csv > ZBND39X.csv.sorted
join -t, ZBND19X.csv.sorted ZBND39X.csv.sorted > RPKM_all.csv
join -v1 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,$2,0}' >> RPKM_all.csv
join -v2 -t, ZBND19X.csv.sorted ZBND39X.csv.sorted | awk -F, -v OFS=, '{print $1,0,$2}' >> RPKM_all.csv

the awk code (awk -F, 'FNR==1 {c++; h=h sep $2; sep=FS; next} ):
does anyone can do more explanation on this, the code doesn't print the header correctly, all the headers just jump to different rows and the first header is missing too
P21
P22
P24
P24
AamoA_EU022762 1 1 0 0
AamoA_EU099963 0 1 0 0

Related

Sort and split CSV file using sed or awk

I have a CSV file (test.csv) that looks like this:
WH_01,TRAINAMS,A10,1221-ESD
WH_03,TRAINLON,L10A3,3005-21
WH_01,TRAINAMS,A101,PWR-120
WH_02,TRAINCLE,A1,074-HD-SATA
WH_01,TRAINAMS,A10,PWR-120
WH_02,TRAINCLE,A15,102-55665
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
1). I can sort the file based on the value in column# 2 as follows:
sort -t, -k2,2 test.csv > testsort.csv
2). Next I would like to split the file based on the value in column# 2. Using the above example, it should create 3 files:
testsort_1.csv:
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
testsort_2.csv:
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv:
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
How can I do this? Not sure if the sort is even required and if the above can be achieved without sorting.
Thank you.
Good move separating sort and awk.
$ sort -t, -k2,2 test.csv |awk -F, '!($2 in T) {T[$2]=++i} {print > ("testsort_" i ".csv")}'
$ tail -n +1 testsort*
==> testsort_1.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> testsort_2.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> testsort_3.csv <==
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
!($2 in T) - If the second field is not found in the indices of array T,
{T[$2]=++i} - increment the counter and save the second field as an index.
{print} - print every line
> "file" - overwrite, redirect, and append output to file
("." i ".") - concatenate "strings" and variable
Since you're not sure if you need to sort that almost certainly means you don't and you just think it'd be useful for some reason plus you're just sorting on $2 and then splitting into different files based on the value of $2 so sorting is doing no good whatsoever.
All you actually need to do is:
awk -F, '{print > ($2".csv")}'
Look:
$ ls
test.csv
$ awk -F, '{print > ($2".csv")}' test.csv
$ ls
test.csv TRAINAMS.csv TRAINCLE.csv TRAINLON.csv
$ tail -n +1 TRAIN*
==> TRAINAMS.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> TRAINCLE.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> TRAINLON.csv <==
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
If you got past about 20 output file names and weren't using GNU awk then you'd have to close() each one whenever $2 changes and use >> instead of > to append to them.
If for some reason you really do need to use the output file names from your question then that'd be:
awk -F, '!($2 in map){map[$2]="testsort_"++cnt".csv"} {print > map[$2]}' test.csv
You can do it in a fairly simple fashion by keeping a counter for the filename and using sprintf to create the filename for each successive group of files. You use the FNR (file record number) to distinguish between the first and subsequent records.
For example:
$ sort -t, -k2 file.csv |
awk -F, -v cnt=1 -v fn="testsort_1.csv" '
FNR==1 {
prev=$2
print $0 > fn
}
FNR>1 {
if ($2!=prev) {
cnt++
fn=sprintf("%s_%d.csv", "testsort", cnt)
}
print $0 > fn
prev=$2
}'
(note: you set the initial filename as a variable to begin, and then create all subsequent filenames from your cnt (count) using sprintf. prev tracks the second field from the previous record. fn is the filename created by sprintf and the counter.)
A shorter version of the same script declaring prev as a variable initially, would be:
sort -t, -k2 file.csv |
awk -F, -v cnt=0 -v prev="" '{
if ($2!=prev) {
cnt++
fn = "testsort_" cnt ".csv"
prev=$2
}
print $0 > fn
}'
If you do not wish to have sequentially numbered files, but instead want the "testsort_number.csv" taken from the sorted records, look at #Cyrus now-deleted answer that provides an excelled (and shorter) solution in that regard. (I see you already have great answer)
Example Use/Output
With your input in file.csv, the following output files would be created:
$ for i in testsort_{1..3}.csv; do printf "\n%s\n" $i; cat $i; done
testsort_1.csv
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A10,PWR-120
WH_01,TRAINAMS,A101,PWR-120
testsort_2.csv
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859

Merging 2 sorted files(with similar content) based on Timestamp in shellscript

I have 2 identical files with below content:
File1:
1,Abhi,Ban,20180921T09:09:01,EmpId1,SalaryX
4,Bbhi,Dan,20180922T09:09:03,EmpId2,SalaryY
7,Cbhi,Ean,20180923T09:09:05,EmpId3,SalaryZ
9,Dbhi,Fan,20180924T09:09:09,EmpId4,SalaryQ
File2:
11,Ebhi,Gan,20180922T09:09:02,EmpId5,SalaryA
12,Fbhi,Han,20180923T09:09:04,EmpId6,SalaryB
3,Gbhi,Ian,20180924T09:09:06,EmpId7,SalaryC
5,Hbhi,Jan,20180925T09:09:08,EmpId8,SalaryD
I want to append all File1's content in Files (based on the date in ascending order)
Outcome:
1,Abhi,Ban,20180921T09:09:01,EmpId1,SalaryX
11,Ebhi,Gan,20180922T09:09:02,EmpId5,SalaryA
4,Bbhi,Dan,20180922T09:09:03,EmpId2,SalaryY
12,Fbhi,Han,20180923T09:09:04,EmpId6,SalaryB
7,Cbhi,Ean,20180923T09:09:05,EmpId3,SalaryZ
3,Gbhi,Ian,20180924T09:09:06,EmpId7,SalaryC
9,Dbhi,Fan,20180924T09:09:09,EmpId4,SalaryQ
5,Hbhi,Jan,20180925T09:09:08,EmpId8,SalaryD
You can use below AWK construct to do this :-
awk -F "," 'NR==FNR{print $4, $0;next} NR>FNR{print $4, $0;}' f1.txt f2.txt | sort | awk '{print $2}'
Explanation :-
Prefix date column ($4) before every line ($0) for both the files.
sort it. And Then print $2 which is whole line.
These printed lines will be in sorted order by date.
f1.txt and f2.txt are two file names.
You can try the following command
awk 'FNR==NR{a[FNR]=$0;next}{print a[FNR]"\n"$0}' file1 file2
with an array a store file1's datas, FNR is a's key.

Vlookup using awk command

I have two files in my linux server.
File 1
9190784
9197256
9170546
9184139
9196854
File 2
S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP
I want to write a script or a single line command in bash using awk command, so as whatever the element in File -1 should match the same with column 1 in File -2 and print Column 1, Column2 and Column3. Also if any entry is not found it should print entry from file 1 and print NA in Column 2 and Column 3
Output : it should redirect the output to a new file as below.
new_file
9190784,TGM,AP
9197256,NA,NA
9170546,NA,NA
9184139,AGM,KN
9196854,TGM,AP
I hope the query is understandable. Anyone please help me on the same.
standard join operation with awk
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[$2]=$3 OFS $4; next}
{print $1, (($1 in a)?a[$1]:"NA" OFS "NA")} file2 file1
substring variation (not tested)
$ awk 'BEGIN {FS=OFS=","}
NR==FNR {a[substr($2,1,7)]=$3 OFS $4; next}
{key=substr($1,1,7);
print $1, ((key in a)?a[key]:"NA" OFS "NA")} file2 file1
Does it have to be awk? It's done with join:
Having two files:
echo '9190784
9197256
9170546
9184139
9196854' >file2
echo 'S NO.,Column1,Column2,Column3
72070,9196854,TGM,AP
72071,9172071,BGM,MP
72072,9184139,AGM,KN
72073,9172073,TGM,AP' > file1
One can join the on , as separator on the second field from the first file1 -12 with removed the first header line tail -n +2 and sorted using the second field sort -t, -k2 with the first field from the second file -21 sorted sort.
join -t, -12 -21 -o1.2,1.3,1.4 <(tail -n +2 file1 | sort -t, -k2) <(sort file2)
will output:
9184139,AGM,KN
9196854,TGM,AP

comparing CSV files in ubuntu

I have two CSV files and I need to check for creations, updates and deletions. Take the following example files:
ORIGINAL FILE
sku1,A
sku2,B
sku3,C
sku4,D
sku5,E
sku6,F
sku7,G
sku8,H
sku9,I
sku10,J
UPDATED FILE
sku1,A
sku2,B-UPDATED
sku3,C
sku5,E
sku6,F
sku7,G-UPDATED
sku11, CREATED
sku8,H
sku9,I
sku4,D-UPDATED
I am using the linux comm command as follows:
comm -23 --nocheck-order updated_file.csv original_file > diff_file.csv
Which gives me all newly created and updated rows as follows
sku2,B-UPDATED
sku7,G-UPDATED
sku11, CREATED
sku4,D-UPDATED
Which is great but if you look closely "sku10,J" has been deleted and I'm not sure the best command/way to check for it. The data I have provided is merely demo, the text "sku" does not exist in the real data however column one of the CSV files are a unique 5 character indentifier. Any advice is appreciated.
I'd use join instead:
join -t, -a1 -a2 -eMISSING -o 0,1.2,2.2 <(sort file.orig) <(sort file.update)
sku1,A,A
sku10,J,MISSING
sku11,MISSING, CREATED
sku2,B,B-UPDATED
sku3,C,C
sku4,D,D-UPDATED
sku5,E,E
sku6,F,F
sku7,G,G-UPDATED
sku8,H,H
sku9,I,I
Then I'd pipe that into awk
join ... | awk -F, -v OFS=, '
$3 == "MISSING" {print "deleted: " $1,$2; next}
$2 == "MISSING" {print "added: " $1,$3; next}
$2 != $3 {print "updated: " $0}
'
deleted: sku10,J
added: sku11, CREATED
updated: sku2,B,B-UPDATED
updated: sku4,D,D-UPDATED
updated: sku7,G,G-UPDATED
This might be a really crude way of doing it, but if you are certain that the values in each file do not repeat, then:
cat file1.txt file2.txt | sort | uniq -u
If each file contains repeating strings, then you can sort|uniq them before concatenation.

using AWK to join certain lines of two files and then sort the merged file

I use the following AWK command to sort the contents after the first 24 lines of a file:
awk 'NR <= 24; NR > 24 {print $0 | "sort -k3,3 -k4,4n"}' file > newFile
Now I want to join two files first (now simply discard the first 24 lines for both files) and then sort the merged file. Is there any way to do it without generating a temporary merged file?
awk 'FNR > 24' file1 file2 | sort -k3,3 -k4,4n > newFile
FNR is the file record number (resets to 1 for the first line of each file). If you insist on having the sort inside the awk script, you can use:
awk 'FNR > 24 { print $0 | "sort -k3,3 -k4,4n" }' file1 file2 > newFile
but I prefer the shell to do my piping.

Resources