Merge two csv files if Id columns match

Merge two csv files if Id columns match - bash

I have the following:
file1.csv
"Id","clientName1","clientName2"
file2.csv
"Id","Name1","Name2"
I want to read file1 sequentially. For each record, I want to check if there is a matching Id in file2. There may be more than one match. For each match, I want to append Name1, Name2 to the end of the record of file1.csv
So, possible result, if a record has more than one match in file2:
"Id","clientName1","clientName2","Name1","Name2","Name1","Name2"

A regex solution by using join and GNU sed
join -t , -a 1 file[12].csv | sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'
assume that both file1.csv and file2.csv are sorted by id and without header
file1.csv
1,c11,c12
2,c21,c22
3,c31,c32
file2.csv
1,n11,n12
1,n21,n22
1,n31,n32
2,n41,n42
gives a result of
1,c11,c12,n11,n12,n21,n22,n31,n32
2,c21,c22,n41,n42
3,c31,c32
UPDATE
In case where file1.csv might contain duplicate ids and various field lengths, I would suggest to perform a pre-process to make sure file1.csv is clean before joining with file2.csv
awk -F, '{for(i=2;i<=NF;i++) print $1 FS $i}' file1.csv |\
sort -u |\
sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'
the first awk process splits all data into (id, name) pairs
sort -u sorts and uniques each pairs
the last sed process merge all pairs with same ids into a single row
input
1,c11,c12
1,c12,c14,c13
1,c15,c12
2,c21,c22
output
1,c11,c12,c13,c14,c15
2,c21,c22

I'm afraid bash may not be the efficient solution but the following bash script would work:
#!/bin/bash
declare -A id_hash
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file1.csv
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file2.csv
for id in ${!id_hash[#]}; do
echo $id,${id_hash[$id]}
done

Thanks to all but this has been completed. The code I wrote is below:
#!/bin/bash
echo
echo 'Merging files into one'
IFS=","
while read id lname fname dnaid status type program startdt enddt ref email dob age add1 add2 city postal phone1 phone2
do
var="$dnaid,$lname,$fname,$status,$type,$program,$startdt,$enddt,$ref,$email,$dob,$age,$add1,$add2,$city,$postal,$phone1,$phone2"
while read id2 cwlname cwfname
do
if [ $id == $id2 ]
then
var="$var,$cwlname,$cwfname"
fi
done < file2.csv
echo "$var" >> /root/scijoinedfile.csv
done < file1.csv
echo
echo "Merging completed"

In response to the OP's clarification in his/her comment, here is the revised version of the single awk command which does merge in case there was duplicated IDs either in file1 or file2 or in both and if with different number of fields. old version which it works for OP's current stated question
awk -F',' '{one=$1;$1="";a[one]=a[one]$0} END{for (i in a) print i""a[i]}' OFS=, file[12]
For the inputs:
file1
"Id1","clientN1","clientN2"
"Id2","Name3","Name4"
"Id3","client00","client01","client02"
"Id1","client1","client2","client3"
file2
"Id1","Name1","Name2"
"Id1","Name3","Name4"
"Id2","Name0","Name1"
"Id2","Name00","Name11","Name22"
The output is merged file1 and file2 on same IDs:
"Id1","clientN1","clientN2","client1","client2","client3","Name1","Name2","Name3","Name4"
"Id2","Name3","Name4","Name0","Name1","Name00","Name11","Name22"
"Id3","client00","client01","client02"

Related

Adding data to line in CSV if value exists in external file

Here is my sample data:
1,32425,New Zealand,number,21004
1,32425,New Zealand,number,20522
1,32434,Australia,number,1542
1,32434,Australia,number,986
1,32434,Fiji,number,1
Here is my expected output:
1,32425,New Zealand,number,21004,No
1,32425,New Zealand,number,20522,No
1,32434,Australia,number,1542,No
1,32434,Australia,number,986,No
1,32434,Fiji,number,1,Yes
Basically I am trying to append the Yes/No based on if field 3 is contained in an external file. Here is what I have currently but as I understand it grep is eating all the stdin in the while loop. So I am only getting No added to the end of each line as the first value is not contained in the external file.
while IFS=, read -r type id country number volume
do
if grep $country externalfile.csv
then
echo "${country}"
sed 's/$/,Yes/' >> file2.csv
else
echo "${country}"
sed 's/$/,No/' >> file2.csv
fi
done < file1.csv
I added the echo "${country}" as I was trying to troubleshoot and that's how I discovered it was only parsing the first line.

Assuming there are no headers -
awk -F, 'NR==FNR{lookup[$1]=$1; next;}
{ if ( lookup[$3] == $3 ) { print $0 ",Yes" } else { print $0 ",No" } }
' externalfile.csv file2.csv
This will parse both files in one pass.
If you just prefer to do it in pure bash,
declare -A lookup
while read c; do lookup["$c"]="$c"; done < externalfile.csv
declare -p lookup # this is just to show you what my example loaded
declare -A lookup='([USA]="USA" [Fiji]="Fiji" )'
while IFS=, read a b c d; do
[[ -n "${lookup[$c]}" ]] && echo "$a,$b,$c,$d,Yes" || echo "$a,$b,$c,$d,No"
done < file2.csv
1,32425,New Zealand,number,21004,No
1,32425,New Zealand,number,20522,No
1,32434,Australia,number,1542,No
1,32434,Australia,number,986,No
1,32434,Fiji,number,1,Yes
No grep needed.

awk -F, -v OFS=, 'NR == FNR { ++a[$1]; next } { $(++NF) = $3 in a ? "Yes" : "No" } 1' externalfile.csv file2.csv

Try this:
while read -r line
do
country=`echo $line | cut -d',' -f3`
if grep "$country" externalfile.csv
then
echo "$line,Yes" >> file2.csv
else
echo "$line,No" >> file2.csv
fi
done < test.txt
You need to put $country inside the ", because some country could contains more than 1 word. For example New Zealand. You can also set country variable easier using cut command.

Comparing Columns in Different CSV then Print Non-Matches

I'm a scripting newbie and am looking for help in building a BASH script to compare different columns in different CSV documents and then print the non-matches. I've included an example below.
File 1
Employee ID Number,Last Name,First Name,Preferred Name,Email Address
File 2
Employee Name,Email Address
I am wanting to compare the email address column in both files. If File 1 does not contain an email address found in File 2, I want to output to a new file.
Thanks in advance!

This is how I would do it:
#!/bin/bash
#
>output.txt
# Read file2.txt, line per line...
cat file2.txt | while read line2
do
# Extract the email from the line
email2=$(echo $line2 | cut -d',' -f2)
# Verify if the email is in file1
if [ $(grep -c $email2 file1.txt) -eq 0 ]
then
# It is not, so output the line from file2 to the output file
echo $line2 >>output.txt
fi
done

Email addresses are case insensitive which should be a consideration when comparing them. This version uses a little awk to handle the case bit and with selecting the last field in each line ($NF):
#!/bin/bash
our_addresses=( $(awk -F, '{print tolower($NF)}' file1) )
while read -r line; do
this_address=$(awk -F, '{print tolower($NF)}' <<< "$line")
if [[ ! " ${our_addresses[#]} " =~ " $this_address " ]]; then
echo "$line"
fi
done < file2

Remove first row from file with shell script using if statement

There are some data files being imported with header names on the first row and others dont have headers. The ones that are with headers are having always "company" as first field on the first row. For loading them into DB I need to get rid of the first row. So I need to write .sh scrict that deletes first row only of those files that have first column first row="company". I guess I need to combine awk with if statement but I dont know exactly how.

if head -n 1 input.csv | cut -f 1 -d ',' | grep company
then tail -n +2 input.csv > output.csv
else
cp input.csv output.csv
fi

If you're sure the string "company" appears only as 1st field on headers, you can go this way
sed -e /^company,/d oldfile > newfile
supposing the separator is a comma.
Another solution :
if [ head -1 oldfile | grep -q "^company,"] ; then
sed -e 1d oldfile > newfile
else
cp oldfile newfile
fi

No if needed. Just do it straight forward as you stated your requirements:
Print the first line unless it starts with company:
strip_header_if_present() {
IFS='' read -r first_line
echo "$first_line" | grep -v ^company,
Now print the remaining lines:
cat
}
To use this shell function:
strip_header_if_present < input.csv > output.csv

Key Matching using shell

I wanted to see different type of answers I receive from you guys for the below problem. I am curious to see below problem being solved completely through array or any other matching (if there is any).
Below is the problem. Keeping Name as the key we need to print their various phone numbers in a line.
$cat input.txt
Name1, Phone1
Name2, Phone2
Name3, Phone1
Name4, Phone5
Name1, Phone2
Name2, Phone1
Name4, Phone1
O/P:
$cat output.txt
Name1,Phone1,Phone2
Name2,Phone2,Phone1
Name3,Phone1
Name4,Phone5,Phone1
I solved the above problem but I wanted to see a solving technique perhaps one that is more effective than me. I am not an expert in shell still at a beginner level. My code below:
$cat keyMatchingfunction.sh
while read LINE; do
var1=(echo "$LINE"|awk -F\, '{ print $1 }')
matching_line=(grep "$var1" output.txt|wc -l)
if [[ $matching_line -eq 0 ]]; then
echo "$LINE" >> output.txt
else
echo $LINE is already present in output.txt
grep -q -n "$var1" output.txt
line_no=(grep -n "$var1" output.txt|cut -d: -f1)
keymatching=(echo "$LINE"|awk -F\, '{ print $2 }')
sed -i "$line_no s/$/,$keymatching/" output.txt
fi
done

Try this:
awk -F', ' '{a[$1]=a[$1]","$2}END{for(i in a) print i a[i]}' input.txt
Output:
Name1,Phone1,Phone2
Name2,Phone2,Phone1
Name3,Phone1
Name4,Phone5,Phone1

With bash and sort:
#!/bin/bash
declare -A array # define associative array
# read file input.txt to array
while IFS=", " read -r line number; do
array["$line"]+=",$number"
done < input.txt
# print array
for i in "${!array[#]}"; do
echo "$i${array[$i]}"
done | sort
Output:
Name1,Phone1,Phone2
Name2,Phone2,Phone1
Name3,Phone1
Name4,Phone5,Phone1

BASH script - print sorted contents from all files in directory with no rep's

In the current directory there are files with names of the form "gradesXXX" (where XXX is a course number) which look like this:
ID GRADE (this line is not contained in the files)
123456789 56
213495873 84
098342362 77
. .
. .
. .
I want to write a BASH script that prints all the IDs that have a grade above a certain number, which is given as the first parameter to said script.
The requirements are that an ID must be printed once at most, and that no intermediate files are used.
I was guided to use two scripts - the first with length of one line, and the second with length of up to six lines (not including the "#!" line).
I'm quite lost with this one so any suggestions will be appreciated.
Cheers.

The answer I was looking for was
// internal script
#!/bin/bash
while read line; do
line_split=( $line )
if (( ${line_split[1]} > $1 )); then
echo ${line_split[0]}
fi
done
// external script
#!/bin/bash
cat grades* | sort -r -n -k 1 | internalScript $1 | cut -f1 -d" " | uniq

OK, a simple solution.
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt 60 ] ; then break ; fi ; echo $ID ; done | sort -u
I'm not sure why two scripts should be necessary. All in a script:
#!/bin/bash
threshold=$1
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt $threshold ] ; then break ; fi ; echo $ID ; done | sort -u
We first cat all the grade files, the sort them by grade in reverse order. The while loop breaks if grade is below threshold, so that only lines with higher grades get their ID printed. sort -u makes sure that every ID is sent only once.

You can use awk:
awk '{ if ($2 > 70) print $1 }' grades777
It prints the first column of every line which seconds column is greater than 70. If you need to change the threshold:
N=71
awk '{ if ($2 > '$N') print $1 }' grades777
That ' are required to pass shell variables in AWK. To work with all grade??? files in the current directory and remove duplicated lines:
awk '{ if ($2 > '$N') print $1 }' grades??? | sort -u
A simple one-line solution.

Yet another solution:
cat grades[0-9][0-9][0-9] | awk -v MAX=70 '{ if ($2 > MAX) foo[$1]=1 }END{for (id in foo) print id }'
Append | sort -n after that if you want the IDs in sorted order.

In pure bash :
N=60
for file in /path/*; do
while read id grade; do ((grade > N)) && echo "$id"; done < "$file"
done
OUTPUT
213495873
098342362

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Merge two csv files if Id columns match - bash

Related

Adding data to line in CSV if value exists in external file

Comparing Columns in Different CSV then Print Non-Matches

Remove first row from file with shell script using if statement

Key Matching using shell

BASH script - print sorted contents from all files in directory with no rep's

Categories

Resources