compare 2 columns from 2 different csv files - shell

My intention is to compare a particular column of 2 different csv files & get the data from second file what is not there in first file. For example.
First File
"siddhartha",1
"mukherjee",2
Second file
"siddhartha",1
"mukherjee",2
"unique",3
Expected output
"unique",3
The below command is working properly when the text size of the first column is limited, so in the above example its working.
awk -F',' 'FNR==NR{a[$1];next};!($1 in a);' file1.csv file2.csv > file3.csv
but is the text size of the 1st column is quite large (for example 10000 char), its not working. its cutting the text at a certain point.
Any solution for this?

Did you tried the comm command?
something like this: comm -23 file2.csv file1.csv
Please read about it on man comm; and both files should be sorted before.
You can user sort to do that

Maybe the below awk
awk 'BEGIN{FS=","};FNR==NR{a[$1];next};!($1 in a)' file1 file2
"unique",3
or
awk -F',' 'FNR==NR{a[$1];next};!($1 in a)' file1 file2
"unique",3
Set the field separator to comma and read each $1 value into a key

Related

Remove duplicate records from a csv file considering single column

I have a file with records in such a type-
,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D
3,22DE17,BA,S6CD6728,24JA13,6A
4,12FE18,AA,S6FD7688,25DA15,7D
I want to remove duplicate records considering 4th column which has "S6CD6728" these type of record and skipping first row which is
",laac_repo,cntrylist,idlist,domlist,type list"
I have tried
awk '{a[$4]++}!(a[$4]-1)' filename
And also tried
awk 'FNR > 1 {a[$4]++}!(a[$4]-1)' filename
The expected output is-
,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D
P.S file has more than 10 million records, please suggest solution w.r.t that.( If any script given much appreciated, instead of single command).
What about this:
awk -F, 'FNR>1 && \!seen[$4]++' filename
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D
awk -F, '\!seen[$4]++' filename
,laac_repo,cntrylist,idlist,domlist,typelist
1,22DE17,BA,S6CD6728,24JA13,6A
2,12FE18,AA,S6FD7688,25DA15,7D

How to use awk to extract columns from a file based on a key?

I have a file of keys, file1, (each key on a new line) that I need to use to extract certain columns from a second file, file2. File 1 is 46 lines long, while file2 is much larger, >20,000 lines long. Not all keys that appear in file1 appear in file2, and vice versa.
file1:
322510472
322510472
322510472
322510484
322510484
322510484
322510493
file2:
109287879,Invertebrate_iridescent_virus_3,109287879,148,1,148,NCVOG0391,0
109287880,Invertebrate_iridescent_virus_3,109287880,458,1,458,,
109287881,Invertebrate_iridescent_virus_3,109287881,156,1,156,,
109287882,Invertebrate_iridescent_virus_3,109287882,451,1,451,NCVOG1423,0
109287883,Invertebrate_iridescent_virus_3,109287883,217,1,217,NCVOG4910,2
109287884,Invertebrate_iridescent_virus_3,109287884,494,1,494,NCVOG0211,0
109287885,Invertebrate_iridescent_virus_3,109287885,447,1,447,NCVOG1077,0
109287886,Invertebrate_iridescent_virus_3,109287886,347,1,347,NCVOG0967,2
Both file1 and file2 are sorted by the key, which appears in columns 1 and 3 of file2.
I need to produce a third file, file3, which contains my keys from file1, as well as columns 2 and 7 from file2, and which does not omit any keys present in file1, even if there are no matching entries in file2.
I know that I have 46 entries in my file of keys, file1. However, when I use the following awk script,
awk -F"," 'NR==FNR {a[$1]=$1 FS $2 FS $7; next} $1 in a {print a[$1],$2,$7}' file2 file1
I only see 44 lines of output.
I need to not delete any of my keys in my awk output, as they correspond to actual data I need to keep in other files also containing those keys.
Any suggestions? Thanks for any and all help; I've been reading stack overflow for a while but this is my first time asking a question!
your data doesn't have any matches, so I modified your keys to include matching entries
$ join -t, -a1 -o1.1,2.2,2.7 file1 file2
109287879,Invertebrate_iridescent_virus_3,NCVOG0391
109287880,Invertebrate_iridescent_virus_3,
109287882,Invertebrate_iridescent_virus_3,NCVOG1423
109287884,Invertebrate_iridescent_virus_3,NCVOG0211
109287886,Invertebrate_iridescent_virus_3,NCVOG0967
322510472,,
322510472,,
322510472,,
here is the file1 I used instead.
109287879
109287880
109287882
109287884
109287886
322510472
322510472
322510472

How do I merge 2 files if a different column in each file match and both files are in csv/double quote separated formats?

I've got 2 csv/double quote separated files. Column 26 in file 1 and column 2 in file 2 both contain domains and if I run the following
awk -F'"' '{print $26}' file1.csv
awk -F'"' '{print $2}' file2.txt
Then I can see that file 1 has 6 domains and file 2 has 3 domains.
All of the domains in file 2 are also in file 1.
I'd like to generate a new file containing of all columns in file 1 plus all of the columns in file 2 if column 2 in file 2 matches column 26 in file 1.
Also, I'm pretty sure that column 26 is always the last column in file 1 but file 2 can have any number of columns.
Does anyone know how can I do this in bash, awk, sed or similar?
#Bruce: Try:
awk -F'"' 'FNR==NR{A[$26]=$0;next} ($2 in A){print A[$2] FS $0}' file1 file2
So here I am checking FNR==NR condition which will be TRUE only when first file file1 is being read, then creating an array named A whose index is $26 field and setting it's value to current line and putting next will skip all further statements. Then checking $2 of file2 is present in file1's array A then printing the array A's value with current line's value.
Kindly provide sample Input_file and expected output in case above doesn't meet your requirements .

Joining specific parts of text from two files in third?

My question is based again, on linux shell programming, and this time, I have two textual files, with about 17,000 lines in each.
In first file i have lines which have this form:
[*] 11004, e01c5dee8efb188af91fb989a1039a12, isabelleann86#yahoo.com
And second file has form for each line:
e01c5dee8efb188af91fb989a1039a12:nathan09
Now I want to create third file from these two, to have form of:
isabelleann86#yahoo.com:nathan09
But notation please, hash e01c5dee8efb188af91fb989a1039a12 must correspond to both lines in first and second file, not like creating line with email_1 and password_3421.
Email from file one, and password from file two, where line has the same hash value?
I know it is maybe possible by using grep/awk combination, but I just do not know how to form it.
Here's one way using awk with multiple delimiters:
awk -F "[ ,:]+" 'FNR==NR { a[$3]=$4; next } $1 in a { print a[$1], $2 }' OFS=":" file1 file2 > file3
Results; contents of file3:
isabelleann86#yahoo.com:nathan09
Using awk
awk 'NR==FNR{a[$(NF-1)]=$NF;next}
$1"," in a {print a[$1","] FS $NF}' file1 FS=: file2

Bash script compare values from 2 files and print output values from one file

I have two files like this;
File1
114.4.21.198,cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
114.4.21.205,cl_id=1O3M7A7Q0S3C6h85902g7b3h7_101pf
114.4.21.205,cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
114.4.21.213,cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
File2
cl_id=1B3O7M6C8T4O1b559i2g930m0_1165d
cl_id=1X3J7M6J0W5S9535180h90302_101p5
cl_id=1G3D7X6V6A7R81356e3g527m9_101nl
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
cl_id=1Q3Y7Q7J0M3E62953e5g3g5k0_117p6
I want to compare cl_id values that exist on file1 but not exist on file2 and print out the first values from file1 (IP Address).
it should be like this
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
114.4.23.70
114.4.21.201
114.4.21.211
120.172.168.36
I have tried awk,grep diff, comm. but nothing come close. Please tell the correct command to do this.
thanks
One proper way to that is this:
grep -vFf file2 file1 | sed 's|,cl_id.*$||'
I do not see how you get your output. Where does 120.172.168.36 come from.
Here is one solution to compare
awk -F, 'NR==FNR {a[$0]++;next} !a[$1] {print $1}' file2 file1
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
Feed both files into AWK or perl with field separator=",". If there are two fields, add the fields to a dictionary/map/two arrays/whatever ("file1Lines"). If there is just one field (this is file 2), add it to a set/list/array/whatever ("file2Lines"). After reading all input:
Loop over the file1Lines. For each element, check whether the key part is present in file2Lines. If not, print the value part.
This seems like what you want to do and might work, efficiently:
grep -Ff file2.txt file1.txt | cut -f1 -d,
First the grep takes the lines from file2.txt to use as patterns, and finds the matching lines in file1.txt. The -F is to use the patterns as literal strings rather then regular expressions, though it doesn't really matter with your sample.
Finally the cut takes the first column from the output, using , as the column delimiter, resulting in a list of IP addresses.
The output is not exactly the same as your sample, but the sample didn't make sense anyway, as it contains text that was not in any of the input files. Not sure if this is what you wanted or something more.

Resources