AWK: Compare two CSV files - shell

I have two CSV files and I want to compare them using AWK and generate a new file.
file1.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc123","C:/pro/xyz"
"abc124","C:/pro/in"
file2.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc125","C:/pro/xyz"
"abc126","C:/pro/in"
output.csv:
"file1","file2","Diff"
"abc121","abc121","Match"
"abc122","abc122","Match"
"abc123","","Unmatch"
"abc124","","Unmatch"
"","abc125","Unmatch"
"","abc126","Unmatch"

One way with awk:
script.awk:
BEGIN {
FS = ","
}
NR>1 && NR==FNR {
a[$1] = $2
next
}
FNR>1 {
print ($1 in a) ? $1 FS $1 FS "Match" : "\"\"" FS $1 FS "Unmatch"
delete a[$1]
}
END {
for (x in a) {
print x FS "\"\"" FS "Unmatch"
}
}
Output:
$ awk -f script.awk file1.csv file2.csv
"abc121","abc121",Match
"abc122","abc122",Match
"","abc125",Unmatch
"","abc126",Unmatch
"abc124","",Unmatch
"abc123","",Unmatch

I didn't use awk alone, but if I understood the gist of what you're asking correctly, I think this long one-liner should do it...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 file1.csv file2.csv | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e '1d' -e 's/^,/"",/' -e 's/,$/,"" /' -e 's/,,/,"",/g'
Description:
The join portion takes the two CSV files, joins them on the first column (default behavior of join) and outputs all four fields (-o 1.1 2.1 1.2 2.2), making sure to include rows that are unmatched for both files (-a 1 -a 2).
The awk portion takes that output and replaces combination of the 3rd and 4th columns to either "Match" or "Unmatch" based on if they do in fact match or not. I had to make an assumption on this behavior based on your example.
The sed portion deletes the "no","loc" header from the output (-e '1d') and replaces empty fields with open-close quote marks (-e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'). This last part might not be necessary for you.
EDIT:
As tripleee points out, the above fails if the two initial files are unsorted. Here's an updated command to fix that. It punts the header line and sorts each file before passing them to join...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 <( sed 1d file1.csv | sort ) <( sed 1d file2.csv | sort ) | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'

Related

Write specific columns of files into another files, Who can give me a more concise solution?

I have a troublesome problem about writing specific columns of the file into another file, more details are I have the file1 like below, I need to write the first columns exclude the first row to file2 with one line and separated with '|' sign. And now I have a solution by sed and awk, this missing last step inserts into the top of file2, even though I still believe there should be some more concise solution on account of powerful of awk、sed, etc. So, Who can offer me another more concise script?
sed '1d;s/ .//' ./file1 | awk '{printf "%s|", $1; }' | awk '{if (NR != 0) {print substr($1, 1, length($1) - 1)}}'
file1:
col_name data_type comment
aaa string null
bbb int null
ccc int null
file2:
xxx ccc(whatever is this)
The result of file2 should be this :
aaa|bbb|ccc
xxx ccc(whatever is this)
Assuming there's no whitespace in the column 1 data, in increasing length:
sed -i "1i$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')" file2
or
ed file2 <<END
1i
$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')
.
wq
END
or
{ awk 'NR > 1 {print $1}' file1 | paste -sd '|'; cat file2; } | sponge file2
or
mapfile -t lines < <(tail -n +2 file1)
col1=( "${lines[#]%%[[:blank:]]*}" )
new=$(IFS='|'; echo "${col1[*]}"; cat file2)
echo "$new" > file2
This might work for you (GNU sed):
sed -z 's/[^\n]*\n//;s/\(\S*\).*/\1/mg;y/\n/|/;s/|$/\n/;r file2' file1
Process file1 "wholemeal" by using the -z command line option.
Remove the first line.
Remove all columns other than the first.
Replace newlines by |'s
Replace the last | by a newline.
Append file2.
Alternative using just command line utils:
tail +2 file1 | cut -d' ' -f1 | paste -s -d'|' | cat - file2
Tail file1 from line 2 onwards.
Using the results from the tail command, isolate the first column using a space as the column delimiter.
Using the results from the cut command, serialize each line into one, delimited by |',s.
Using the results from the paste, append file2 using the cat command.
I'm learning awk at the moment.
awk 'BEGIN{a=""} {if(NR>1) a = a $1 "|"} END{a=substr(a, 1, length(a)-1); print a}' file1
Edit: Here's another version that uses an array:
awk 'NR > 1 {a[++n]=$1} END{for(i=1; i<=n; ++i){if(i>1) printf("|"); printf("%s", a[i])} printf("\n")}' file1
Here is a simple Awk script to merge the files as per your spec.
awk '# From the first file, merge all lines except the first
NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
# We are in the second file; add a newline after data from first file
FNR == 1 { printf "\n" }
# Simply print all lines from file2
1' file1 file2
The NR==FNR condition is true when we are reading the first input file: The overall line number NR is equal to the line number within the current file FNR. The final 1 is a common idiom for printing all input lines which make it this far into the script (the next in the first block prevent lines from the first file to reaching this far).
For conciseness, you can remove the comments.
awk 'NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
FNR == 1 { printf "\n" } 1' file1 file2
Generally speaking, Awk can do everything sed can do, so piping sed into Awk (or vice versa) is nearly always a useless use of sed.

sed unterminated 's' command modify line of file

I'm trying to modify a groups.tsv file (I'm on repl.it so path to file is fine).
Each line in the file looks like this:
groupname \t amountofpeople \t lastadded
and I'm trying to count the occurences of both groupname($nomgrp) and a login($login), and change lastadded to login.
varcol2=$(grep "$nomgrp" groups | cut "-d " -f2- | awk -F"\t" '{print $2}' )
((varcol21=varcol2+1));
varcol3=$(awk -F"\t" '{print $3}' groups)
sed -i "s|${nomgrp}\t${varcol2}\t$varcol3|${nomgrp}\t${varcol21}\t${login}|" groups
However, I'm getting the error message:
sed : -e expression #1, char 27: unterminated 's' command
The groups file has lines such as " sudo 2 user1" (delimited with a tab): a user inputs "user" which is stored in $login, then "sudo" which is stored in $nomgrp.
What am I doing wrong?
Sorry if this has been answered/super easy to fix, I'm quite the newbie here...
If I understand what you are trying to do correctly and if you have GNU awk, you could do
gawk -i inplace -F '\t' -v group="$nomgrp" -v login="$login" -v OFS='\t' '$1 == group { $2 = $2 + 1; $3 = login; } { print }' groups.tsv
Example:
$ cat groups.tsv
wheel 1000 2019-12-10
staff 1234 2019-12-11
users 9001 2019-12-12
$ gawk -i inplace -F '\t' -v group=wheel -v login=2019-12-12 -v OFS='\t' '$1 == group { $2 = $2 + 1; $3 = login; } 1' groups.tsv
$ cat groups.tsv
wheel 1001 2019-12-12
staff 1234 2019-12-11
users 9001 2019-12-12
This works as follows:
-i inplace is a GNU awk extension that allows you to change a file in place,
-F '\t' sets the input field separator to a tab so that the input is interpreted as TSV and fields with spaces in them are not split apart,
-v variable=name sets an awk variable for use in awk's code,
specifically, -v OFS='\t' sets the output field separator variable to a tab, so that the output is again a TSV
So we set variables group, login to your shell variables and ensure that awk outputs a TSV. The code then works as follows:
$1 == group { # If the first field in a line is equal to the group variable
$2 = $2 + 1; # add 1 to the second field
$3 = login; # and overwrite the third with the login variable
}
{ # in all lines:
print # print
}
{ print } could also be abbreviated as 1, I'm sure people someone will point out, but I find this way easier to explain.
If you do not have GNU awk, you could achieve the same with a temporary file, e.g.
awk -F '\t' -v group="$nomgrp" -v login="$login" -v OFS='\t' '$1 == group { $2 = $2 + 1; $3 = login; } { print }' groups.tsv > groups.tsv.new
mv groups.tsv.new groups.tsv

awk or shell command to count occurence of value in 1st column based on values in 4th column

I have a large file with records like below :
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
I want to find the no of person (names in col 1) have apple and oranges both. And the command should take as less memory as possible and should be fast. Any help appreciated!
Output :
awk/sed file => 2 (jon and tom)
Using awk is pretty easy:
awk -F, \
'$4 == "apple" { apple[$1]++ }
$4 == "oranges" { orange[$1]++ }
END { for (name in apple) if (orange[name]) print name }' data
It produces the required output on the sample data file:
jon
tom
Yes, you could squish all the code onto a single line, and shorten the names, and otherwise obfuscate the code.
Another way to do this avoids the END block:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && orange[$1]) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && apple[$1]) print $1 }' data
When it encounters an apple entry for the first time for a given name, it checks to see if the name also (already) has an entry for oranges and prints it if it has; likewise and symmetrically, if it encounters an orange entry for the first time for a given name, it checks to see if the name also has an entry for apple and prints it if it has.
As noted by Sundeep in a comment, it could use in:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && $1 in orange) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && $1 in apple) print $1 }' data
The first answer could also use in in the END loop.
Note that all these solutions could be embedded in a script that would accept data from standard input (a pipe or a redirected file) — they have no need to read the input file twice. You'd replace data with "$#" to process file names if they're given, or standard input if no file names are specified. This flexibility is worth preserving when possible.
With awk
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){print $1}' ip.txt ip.txt
jon
tom
This processes the input twice
In first pass, add key to an array if last field is apple (-F, would set , as input field separator)
In second pass, check if last field is oranges and if first field is a key of array a
To print only number of matches:
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){c++} END{print c}' ip.txt ip.txt
2
Further reading: idiomatic awk for details on two file processing and awk idioms
I did a work around and used only grep and comm commands.
grep "apple" file | cut -d"," -f1 | sort > file1
grep "orange" file | cut -d"," -f1 | sort > file2
comm -12 file1 file2 > names.having.both.apple&orange
comm -12 shows only the common names between the 2 files.
Solution from Jonathan also worked.
For the input:
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
the command:
sed -n "/apple\|oranges/p" inputfile | cut -d"," -f1 | uniq -d
will output a list of people with both apples and oranges:
jon
tom
Edit after comment: For an for input file where lines are not ordered by 1st column and where each person can have two or more repeated fruits, like:
jon,1,2,apple
fred,1,2,apple
fred,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
jon,1,2,oranges
tom,1,2,apple
mary,1,2,apple
tom,1,2,oranges
This command will work:
sed -n "/\(apple\|oranges\)$/ s/,.*,/,/p" inputfile | sort -u | cut -d, -f1 | uniq -d

Left outer join of multiple files data using awk command

I have base file and multiple files having common data based on 1st field of base file. I need output file with combination of all data. I have tried many commands due to file size taking to much time for output many times awk helps me out but i don't have any idea of awk array programing
example
Base File
aa
ab
ac
ad
ae
File -1
aa,Apple
ab,Orange
ac,Mango
File -2
aa,1
ab,2
ae,3
Output File expected
aa,Apple,1
ab,Orange,2
ac,Mango,
ad,,
ae,,3
This is what I tried:
awk -F, 'FNR==NR{a[$1]=$0;next}{if(b=a[$1]) print b,$2; else print $1 }' OFS=, test.txt test2.txt
You could try 2 successive join. Something like the following function should work :
join -a 1 -t, -e '' -o auto <(join -a 1 -t, -e '' -o auto base_file file1) file2
Here, we first join base_file and file1, then join the result with file2.
Explanation :
join -a 1 -t, -e '' -o auto base_file file1 :
-a 1 : displays the fields of base_file even if there is no match in the file1
-t, : we treat the character , as our field separator. This impacts both the input files and the output of the function.
-e '' -o auto : when a field is not present, output the string ''. The -e option is dependant on the -o option. -o auto is the default output format.
Output :
aa,Apple,1
ab,Orange,2
ac,Mango,
ad,,
ae,,3
awk way:
awk -F, -v OFS="," 'NR==FNR{a[$1]=$2}FILENAME==ARGV[2]{b[$1]=$2}
FILENAME==ARGV[3]{print $0,a[$0],b[$0]}' f1 f2 base
This will work in any awk for any number of input files:
$ cat tst.awk
BEGIN { FS=OFS="," }
!seen[$1]++ { keys[++numKeys] = $1 }
FNR==1 { ++numFiles }
{ a[$1,numFiles]=$2 }
END {
for (keyNr=1; keyNr <= numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (fileNr=2;fileNr<=numFiles;fileNr++) {
printf "%s%s", a[key,fileNr], (fileNr<numFiles ? OFS : ORS)
}
}
}
$ awk -f tst.awk base file1 file2
aa,Apple,1
ab,Orange,2
ac,Mango,
ad,,
ae,,3

Add a column to any position in a file in unix [using awk or sed]

I'm looking for other alternatives/more intelligent 1 liner for following command, which should add a value to a requested column number.
I tried following following sed command works properly for adding value 4 to the 4th column.
[Need: As i have such file which contains 1000 records & many times i need to add a column in between at any position.]
My approch is sutaible for smaller scale only.
cat 1.txt
1|2|3|5
1|2|3|5
1|2|3|5
1|2|3|5
sed -i 's/1|2|3|/1|2|3|4|/g' 1.txt
cat 1.txt
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
thansk in advance.
Field Separators
http://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html
String Concatenation
http://www.gnu.org/software/gawk/manual/html_node/Concatenation.html
Default pattern and action
http://www.gnu.org/software/gawk/manual/html_node/Very-Simple.html
awk -v FS='|' -v OFS='|' '{$3=$3"|"4} 1' 1.txt
One way using awk. Pass two arguments to the script, the column number and the value to insert. The script increments the number of fields (NF) and goes throught the last one until the indicated position and insert there the new value.
Run this command:
awk -v column=4 -v value="four" '
BEGIN {
FS = OFS = "|";
}
{
for ( i = NF + 1; i > column; i-- ) {
$i = $(i-1);
}
$i = value;
print $0;
}
' 1.txt
With following output:
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5
One way using coreutils and process substitution:
f=1.txt
paste -d'|' \
<(cut -d'|' -f1-3 $f ) \
<(yes 4 | head -n`wc -l < $f`) \
<(cut -d'|' -f4- $f )
One way, using coreutils and process substitution:
sed 's/3|/3|4|/' 1.txt

Resources