Compare csv files based on column value - bash

I have two large csv files:
File1.csv
id,name,code
1,dummy,0
2,micheal,3
5,abc,4
File2.csv
id,name,code
2,micheal,4
5,abc,4
1,cd,0
I want to compare two files based on id and if any of the columns are mismatched, I want to output those rows.
for example for the id 1 name is different and for id 2 the code is different, the output should be:
output
1,cd,0
2,micheal,4
and yes both files will have the same ids, could be in different order though.
I want to write a script that can give me above output.

If you need what in File2 is not paired with File1, you can use Miller and this simple command
mlr --csv join --np --ur -j id,name,code -f File1.csv File2.csv >./out.csv
In output you will have
+----+---------+------+
| id | name | code |
+----+---------+------+
| 2 | micheal | 4 |
| 1 | cd | 0 |
+----+---------+------+

awk -F, 'NR==FNR && FNR!=1 { map[$0]=1;next } FNR!=1 { if ( !map[$0] ) { print } }' File1.csv File2.csv
Set the field separator to comma. For the first file (NR==FNR), create an array map with the line as the first index. Then for the second file, if there is no entry for the line in map, print the line.

The tool of choice for finding differences between files is, of course, diff. Here, it doesn't really matter if these files are comma-separated or in some other format because you're really only interested in lines that differ.
Knowing that both files contain the same IDs makes this quite easy, although the fact that they will not necessarily be in the same order requires to first sort them both.
In your example, you want as output the lines from File2 so running the diff output through a grep for ^> will give you that.
Finally, let's get rid of the two additional characters at the beginning of the output lines that will have been inserted by diff, using cut:
diff <(sort File1.csv) <(sort File2.csv) | grep '^>' | cut -c3-

Related

Merge unsorted lines from two files based on similar part

I am wondering if is it possible to merge information from two files together based on a similar part. file1 is ID with sequence after the blast, and file2 contains taxonomic names corresponding to two first numbers in name of sequences.
file 1:
>301-89_IDNAGNDJ_171582
>301-88_ALPEKDJF_119660
>301-88_ALPEKDJF_112039
...
file2:
301-89--sample1
301-88--sample2
...
output:
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
The files are unsorted and file1 contains more lines where is first two numbers similar to the first two numbers in one line in file2. I am looking for some tips/help on how to do that, it is possible to do that like this? which command or language should I use?
(mawk/nawk/gawk -e/-ce/-Pe) '
FNR == !_ {
_ = ! ( ___=match(FS=FNR==NR ? "[-][-]" : "[>_]", "[>-]"))
$_ = $_
} FNR == NR { __[$!_]="--"$NF; next } sub("$", __[$___])' file2.txt file1.txt
———————————————————————————
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_112039--sample2
>301-88_ALPEKDJF_119660--sample2
Using awk
$ awk -F"[_-]" 'BEGIN{OFS="-"}NR==FNR{a[$2]=$4;next}{print $0,a[$2]}' file2 OFS="--" file1
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
I am wondering if is it possible to merge information from two files together based on a similar part
Yes ...
The files are unsorted
... but only if they're sorted.
It's easier if we transform them so the delimiters are consistent, and then format it back together later:
sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' file1 produces
301-88 ALPEKDJF_112039
301-88 ALPEKDJF_119660
301-89 IDNAGNDJ_171582
...
which we can just pipe through sort -k1
sed 's/--/ /' f2 produces
301-89 sample1
301-88 sample2
...
which we can sort the same way
join sorted1 sorted2 (with the sorted results of the previous steps) produces
301-88 ALPEKDJF_112039 sample2
301-88 ALPEKDJF_119660 sample2
301-89 IDNAGNDJ_171582 sample1
...
and finally we can format those 3 fields as you originally wanted, by piping through
sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
If it's reasonable to sort them on the fly, we can just do that using process substitution:
$ join \
<( sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' f1 | sort -k1 ) \
<( sed 's/--/ /' f2 | sort -k1 ) \
| sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
301-88_ALPEKDJF_112039--sample2
301-88_ALPEKDJF_119660--sample2
301-89_IDNAGNDJ_171582--sample1
...
If it's not reasonable to sort the files - on the fly or otherwise - you're going to end up building a hash in memory, like the awk answer is doing. Give them both a try and see which is faster.

Filter records from one file based on a values present in another file using Unix

I have an Input csv file Input feed
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
There is an output error csv file which is generated from this input file which has the Primary Key
Error File
Pk,Error_Reason
D,Failure
E, Failure
F, Failure
I want to extract all the records from the input file and save it into a new file for which there is a Primary key entry in Error file.
Basically my new file should look like this:
New Input feed
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
I am a beginner in Unix and I have tried Awk command.
The Approach I have tried is, get all the primary key values into a file.
akw -F"," '{print $2}' error.csv >> error_pk.csv
Now I need to filter out the records from the input.csv for all the primary key values present in error.pk
Using awk. As there is leading space in the error file, it needs to be trimmend off first, I'm using sub for that. Then, since the titles of the first column are not identical, (PK vs Pk) that needs to be handled separately with FNR==1:
$ awk -F, ' # set separator
NR==FNR { # process the first file
sub(/^ */,"") # trim leading space
a[$1] # hash the first column
next
}
FNR==1 || ($1 in a)' error input # output tthe header record and if match hashed
Output:
PK,Col1,Col2,Col3,Col4,Col5
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
You can use join.
First remove everything afte the comma from second file
Join on the first field from both files
cat <<EOF >file1
PK,Col1,Col2,Col3,Col4,Col5
A,1,2,3,4,5
B,1,A,B,C,D
C,1,2,3,4
D,2,1,2,3
E,5,1,1,1
F,8,1,1,1
EOF
cat <<EOF >file2
PK,Error_Reason
D,Failure
E,Failure
F,Failure
EOF
join -t, -11 -21 <(sort -k1 file1) <(cut -d, -f1 file2 | sort -k1)
If you need the file to be sorted according to file1, you can number the lines in first file, join the files, re-sort using the line numbers and then remove the numbers from the output:
join -t, -12 -21 <(nl -w1 -s, file1 | sort -t, -k2) <(cut -d, -f1 file2 | sort -k1) |
sort -t, -k2 | cut -d, -f1,3-
You can use grep -f with a file with search items. Cut off at the ,.
grep -Ef <(sed -r 's/([^,]*).*/^\1,/' file2) file1
When you want a header in the output,

Counting the number of names in a category in a .csv with bash

I would like to count the number of students in a .csv file depending on the category
Category 1 is the name, Category 2 is the country, Category 3 is the city
The .csv file is displayed as such :
michael_s;jpa;NYC
john_d;chn;TXS
jim_h;usa;POP
I have tried in my .sh script but it didn't work
sort -k3 -t; students.csv
edit:
I am trying to make a bash script that counts students by city and something that can also count one city just by executing the script such as
cat students.csv | ./script.sh NYC
The terminal will only display the students from NYC
If I've understood you correctly, something like this?
cut -d";" -f3 mike.txt | sort | uniq -c
(Sorry, incorrect solution first time - updated now)
To count only one city:
cut -d";" -f3 mike.txt | grep "NYC" | wc -l
Depending on the size of the file, how often you'll be doing this etc. it may be sensible to look at other solutions, eg. awk. But this solution will work just fine.
The reason for the error message "sort: multi-character tab 'students.csv'" is you haven't given the -t option the separator character. If you add a semicolon after -t, the sort will work as expected:
sort -k3 -t';' students.csv
There is always awk:
$ awk -F\; 'a[$1]++==0{c++}END{print c}' file
3
Once you describe your requirements more throughly, (count the names but sort -k3. Update the OP, please) we can help you better.
Edited to match your update:
$ awk -F\; -v col=3 -v val=NYC '
(length(val) && $col==val) || length(val)==0 && a[$col]++==0 {
c++
}
END { print c }
' file
1
If you set -v val= with the value you are looking for and -v col= with the column number, it counts the occurrences of val in col. You you set col but not val ot counts different values in col.

match awk column value to a column in another file

I need to know if I can match awk value while I am inside a piped command. Like below:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
from here I need to check if the computed value $4*10^10+$6 is present (matches to) in any of the column value of another file. If it is present then print, else just move forward.
File where value needs to be matched is as below:
a,b,c,d,e
1,2,30000000000,3,4
I need to match with the 3rd column of the above file.
I would ideally like this to be in the same command, because if this check is not applied, it prints more than 100 million rows (and a large file).
I have already read this question.
Adding more info:
Breaking my command into parts
part1-command:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "Something:"
part1-output(just showing 1 iteration output):
Something:38|Something1:1|Something2:10588429|Something3:1491539456372358463
part2-command Now I use awk
awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
part2-command output: currently below values are printed (see how i multiplied 1*10^10+10588429 and got 10010588429
1,10588429,10010588429,1491539456372358463
3,12394810,30012394810,1491539456372359082
1,10588430,10010588430,1491539456372366413
Now here I need to put a check (within the command [near awk]) to print only if 10010588429 was present in another file (say another_file.csv as below)
another_file.csv
A,B,C,D,E
1,2, 10010588429,4,5
x,y,z,z,k
10,20, 10010588430,40,50
output should only be
1,10588429,10010588429,1491539456372358463
1,10588430,10010588430,1491539456372366413
So for every row of awk we check entry in file2 column C
Using the associative array approach in previous question, include a hyphen in place of the first file to direct AWK to the input stream.
Example:
grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]'
'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}
NR==FNR {
query[$4*10^10+$6]=$4*10^10+$6;
out[$4*10^10+$6]=$4 FS $6 FS $4*10^10+$6 FS $8;
next
}
query[$3]==$3 {
print out[$3]
}' - another_file.csv > output.csv
More info on the merging process in the answer cited in the question:
Using AWK to Process Input from Multiple Files
I'll post a template which you can utilize for your computation
awk 'BEGIN {FS=OFS=","}
NR==FNR {lookup[$3]; next}
/sometext/ {c=4}
c&&c--&&/somemoretext/ {value= # implement your computation here
if(value in lookup)
print "what you want"}' lookup.file FS=':' grep.files...
here awk loads up the values in the third column of the first file (which is comma delimited) into the lookup array (a hashmap in disguise). For the next set of files, sets the delimiter to : and similar to grep -A3 looks within the 3 distance of the first pattern for the second pattern, does the computation and prints what you want.
In awk you can have more control on what column your pattern matches as well, here I replicated grep example.
This is another simplified example to focus on the core of the problem.
awk 'BEGIN{for(i=1;i<=1000;i++) print int(rand()*1000), rand()}' |
awk 'NR==FNR{lookup[$1]; next}
$1 in lookup' perfect.numbers -
first process creates 1000 random records, and second one filters the ones where the first fields is in the look up table.
28 0.736027
496 0.968379
496 0.404218
496 0.151907
28 0.0421234
28 0.731929
for the lookup file
$ head perfect.numbers
6
28
496
8128
the piped data is substituted as the second file at -.
You can pipe your grep or awk output into a while read loop which gives you some degree of freedom. There you could decide on whether to forward a line:
grep -A3 "sometext" | grep "somemoretext" | while read LINE; do
COMPUTED=$(echo $LINE | awk -F '[:|]' 'BEGIN{OFS=","}{print $4,$6,$4*10^10+$6,$8}')
if grep $COMPUTED /the/file/to/search &>/dev/null; then
echo $LINE
fi
done | cat -

Match and merge lines based on the first column

I have 2 files:
File1
123:dataset1:dataset932
534940023023:dataset:dataset039302
49930:dataset9203:dataset2003
File2
49930:399402:3949304:293000232:30203993
123:49030:1204:9300:293920
534940023023:49993029:3949203:49293904:29399
and I would like to create
Desired result:
49930:399402:3949304:293000232:30203993:dataset9203:dataset2003
534940023023:49993029:3949203:49293904:29399:dataset:dataset039302
etc
where the result contains one line for each pair of input lines that have identical first column (with : as the column separator).
The join command is your friend here. You'll likely need to sort the inputs (either pre-sort the files, or use a process substitution if available - e.g. with bash).
Something like:
join -t ':' <(sort file2) <(sort file1) >file3
When you do not want to sort files, play with grep:
while IFS=: read key others; do
echo "${key}:${others}:$(grep "^${key}:" file1 | cut -d: -f2-)"
done < file2

Resources