Comparing Columns in Different CSV then Print Non-Matches - bash

I'm a scripting newbie and am looking for help in building a BASH script to compare different columns in different CSV documents and then print the non-matches. I've included an example below.
File 1
Employee ID Number,Last Name,First Name,Preferred Name,Email Address
File 2
Employee Name,Email Address
I am wanting to compare the email address column in both files. If File 1 does not contain an email address found in File 2, I want to output to a new file.
Thanks in advance!

This is how I would do it:
#!/bin/bash
#
>output.txt
# Read file2.txt, line per line...
cat file2.txt | while read line2
do
# Extract the email from the line
email2=$(echo $line2 | cut -d',' -f2)
# Verify if the email is in file1
if [ $(grep -c $email2 file1.txt) -eq 0 ]
then
# It is not, so output the line from file2 to the output file
echo $line2 >>output.txt
fi
done

Email addresses are case insensitive which should be a consideration when comparing them. This version uses a little awk to handle the case bit and with selecting the last field in each line ($NF):
#!/bin/bash
our_addresses=( $(awk -F, '{print tolower($NF)}' file1) )
while read -r line; do
this_address=$(awk -F, '{print tolower($NF)}' <<< "$line")
if [[ ! " ${our_addresses[#]} " =~ " $this_address " ]]; then
echo "$line"
fi
done < file2

Related

Merge two csv files if Id columns match

I have the following:
file1.csv
"Id","clientName1","clientName2"
file2.csv
"Id","Name1","Name2"
I want to read file1 sequentially. For each record, I want to check if there is a matching Id in file2. There may be more than one match. For each match, I want to append Name1, Name2 to the end of the record of file1.csv
So, possible result, if a record has more than one match in file2:
"Id","clientName1","clientName2","Name1","Name2","Name1","Name2"
A regex solution by using join and GNU sed
join -t , -a 1 file[12].csv | sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'
assume that both file1.csv and file2.csv are sorted by id and without header
file1.csv
1,c11,c12
2,c21,c22
3,c31,c32
file2.csv
1,n11,n12
1,n21,n22
1,n31,n32
2,n41,n42
gives a result of
1,c11,c12,n11,n12,n21,n22,n31,n32
2,c21,c22,n41,n42
3,c31,c32
UPDATE
In case where file1.csv might contain duplicate ids and various field lengths, I would suggest to perform a pre-process to make sure file1.csv is clean before joining with file2.csv
awk -F, '{for(i=2;i<=NF;i++) print $1 FS $i}' file1.csv |\
sort -u |\
sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'
the first awk process splits all data into (id, name) pairs
sort -u sorts and uniques each pairs
the last sed process merge all pairs with same ids into a single row
input
1,c11,c12
1,c12,c14,c13
1,c15,c12
2,c21,c22
output
1,c11,c12,c13,c14,c15
2,c21,c22
I'm afraid bash may not be the efficient solution but the following bash script would work:
#!/bin/bash
declare -A id_hash
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file1.csv
while read line; do
id=$(echo $line | cut -d ',' -f 1)
name=$(echo $line | cut -d ',' -f 2-)
if [ -z "${id_hash[$id]}" ]; then
id_hash[$id]=$name
else
id_hash[$id]=${id_hash[$id]},$name
fi
done < file2.csv
for id in ${!id_hash[#]}; do
echo $id,${id_hash[$id]}
done
Thanks to all but this has been completed. The code I wrote is below:
#!/bin/bash
echo
echo 'Merging files into one'
IFS=","
while read id lname fname dnaid status type program startdt enddt ref email dob age add1 add2 city postal phone1 phone2
do
var="$dnaid,$lname,$fname,$status,$type,$program,$startdt,$enddt,$ref,$email,$dob,$age,$add1,$add2,$city,$postal,$phone1,$phone2"
while read id2 cwlname cwfname
do
if [ $id == $id2 ]
then
var="$var,$cwlname,$cwfname"
fi
done < file2.csv
echo "$var" >> /root/scijoinedfile.csv
done < file1.csv
echo
echo "Merging completed"
In response to the OP's clarification in his/her comment, here is the revised version of the single awk command which does merge in case there was duplicated IDs either in file1 or file2 or in both and if with different number of fields. old version which it works for OP's current stated question
awk -F',' '{one=$1;$1="";a[one]=a[one]$0} END{for (i in a) print i""a[i]}' OFS=, file[12]
For the inputs:
file1
"Id1","clientN1","clientN2"
"Id2","Name3","Name4"
"Id3","client00","client01","client02"
"Id1","client1","client2","client3"
file2
"Id1","Name1","Name2"
"Id1","Name3","Name4"
"Id2","Name0","Name1"
"Id2","Name00","Name11","Name22"
The output is merged file1 and file2 on same IDs:
"Id1","clientN1","clientN2","client1","client2","client3","Name1","Name2","Name3","Name4"
"Id2","Name3","Name4","Name0","Name1","Name00","Name11","Name22"
"Id3","client00","client01","client02"

Read multiple variables from file

I need to read a file that has lines like
user=username1
pass=password1
How can I read multiple lines like this into separate variables like username and password?
Would I use awk or grep? I have found ways to read lines into variables with grep but would I need to read the file for each individual item?
The end result is to use these variables to access a database via the command line. So I need to be able to read, store and use these values in other commands.
if the process which generates the file is safe and has shell syntax just source the file.
. ./file
Otherwise the file can be processes before to add quotes
perl -ne 'if (/^([A-Za-z_]\w*)=(.*)/) {$k=$1;$v=$2;$v=~s/\x27/\x27\\\x27\x27/g;print "$k=\x27$v\x27\n";}' <file >file2
. ./file2
If you want to use awk then
Input
$ cat file
user=username1
pass=password1
Reading
$ user=$(awk -F= '$1=="user"{print $2;exit}' file)
$ pass=$(awk -F= '$1=="pass"{print $2;exit}' file)
Output
$ echo $user
username1
$ echo $pass
password1
You could use a loop for your file perhaps, but this is probably the functionality you're looking for.
$ echo 'user=username1' | awk -F= '{print $2}'
username1
Using the -F flag sets the delimiter to = and we select the 2nd item from the row.
file.txt:
user=username1
pass=password1
user=username2
pass=password2
user=username3
pass=password3
Do to avoid browsing several times the file file.txt:
#!/usr/bin/env bash
func () {
echo "user:$1 pass:$2"
}
i=0
while IFS='' read -r line; do
if [ $i -eq 0 ]; then
i=1
user=$(echo ${line} | cut -f2 -d'=')
else
i=0
pass=$(echo ${line} | cut -f2 -d'=')
func "$user" "$pass"
fi
done < file.txt
Output:
user:username1 pass:password1
user:username2 pass:password2
user:username3 pass:password3

Bash script read specifc value from files of an entire folder

I have a problem creating a script that reads specific value from all the files of an entire folder
I have a number of email files in a directory and I need to extract from each file, 2 specific values.
After that I have to put them into a new file that looks like that:
--------------
To: value1
value2
--------------
This is what I want to do, but I don't know how to create the script:
# I am putting the name of the files into a temp file
`ls -l | awk '{print $9 }' >tmpfile`
# use for the name of a file
`date=`date +"%T"
# The first specific value from file (phone number)
var1=`cat tmpfile | grep "To: 0" | awk '{print $2 }' | cut -b -10 `
# The second specific value from file(subject)
var2=cat file | grep Subject | awk '{print $2$3$4$5$6$7$8$9$10 }'
# Put the first value in a new file on the first row
echo "To: 4"$var1"" > sms-$date
# Put the second value in the same file on the second row
echo ""$var2"" >>sms-$date
.......
and do the same for every file in the directory
I tried using while and for functions but I couldn't finalize the script
Thank You
I've made a few changes to your script, hopefully they will be useful to you:
#!/bin/bash
for file in *; do
var1=$(awk '/To: 0/ {print substr($2,0,10)}' "$file")
var2=$(awk '/Subject/ {for (i=2; i<=10; ++i) s=s$i; print s}' "$file")
outfile="sms-"$(date +"%T")
i=0
while [ -f "$outfile" ]; do outfile="sms-$date-"$((i++)); done
echo "To: 4$var1" > "$outfile"
echo "$var2" >> "$outfile"
done
The for loop just goes through every file in the folder that you run the script from.
I have added added an additional suffix $i to the end of the file name. If no file with the same date already exists, then the file will be created without the suffix. Otherwise the value of $i will keep increasing until there is no file with the same name.
I'm using $( ) rather than backticks, this is just a personal preference but it can be clearer in my opinion, especially when there are other quotes about.
There's not usually any need to pipe the output of grep to awk. You can do the search in awk using the / / syntax.
I have removed the cut -b -10 and replaced it with substr($2, 0, 10), which prints the first 10 characters from column 2.
It's not much shorter but I used a loop rather than the $2$3..., I think it looks a bit neater.
There's no need for all the extra " in the two output lines.
I sugest to try the following:
#!/bin/sh
RESULT_FILE=sms-`date +"%T"`
DIR=.
fgrep -l 'To: 0' "$DIR" | while read FILE; do
var1=`fgrep 'To: 0' "$FILE" | awk '{print $2 }' | cut -b -10`
var2=`fgrep 'Subject' "$FILE" | awk '{print $2$3$4$5$6$7$8$9$10 }'`
echo "To: 4$var1" >>"$RESULT_FIL"
echo "$var2" >>"$RESULT_FIL"
done

Count mutiple occurences of a word on the same line using grep

Here I made a small script that take input from user searching some pattern from a file and displays required no of lines from that file where the pattern is found. Although this code is searching the pattern line wise due to standard grep practice. I mean if the pattern occurs twice on the same line, i want the output to print twice. Hope I make some sense.
#!/bin/sh
cat /dev/null>copy.txt
echo "Please enter the sentence you want to search:"
read "inputVar"
echo "Please enter the name of the file in which you want to search:"
read "inputFileName"
echo "Please enter the number of lines you want to copy:"
read "inputLineNumber"
[[-z "$inputLineNumber"]] || inputLineNumber=20
cat /dev/null > copy.txt
for N in `grep -n $inputVar $inputFileName | cut -d ":" -f1`
do
LIMIT=`expr $N + $inputLineNumber`
sed -n $N,${LIMIT}p $inputFileName >> copy.txt
echo "-----------------------" >> copy.txt
done
cat copy.txt
As I understood, the task is to count number of pattern occurrences in line. It can be done like so:
count=$((`echo "$line" | sed -e "s|$pattern|\n|g" | wc -l` - 1))
Suppose you have one file to read. Then, code will be following:
#!/bin/bash
file=$1
pattern="an."
#reading file line by line
cat -n $file | while read input
do
#storing line to $tmp
tmp=`echo $input | grep "$pattern"`
#counting occurrences count
count=$((`echo "$tmp" | sed -e "s|$pattern|\n|g" | wc -l` - 1))
#printing $tmp line $count times
for i in `seq 1 $count`
do
echo $tmp
done
done
I checked this for pattern "an." and input:
I pass here an example of many 'an' letters
an
ananas
an-an-as
Output is:
$ ./test.sh input
1 I pass here an example of many 'an' letters
1 I pass here an example of many 'an' letters
1 I pass here an example of many 'an' letters
3 ananas
4 an-an-as
4 an-an-as
Adapt this to your needs.
How about using awk?
Assume the pattern you are searching for is in variable $pattern and the file you are checking is $file
The
count=`awk 'BEGIN{n=0}{n+=split($0,a,"'$pattern'")-1}END {print n}' $file`
or for a line
count=`echo $line | awk '{n=split($0,a,"'$pattern'")-1;print n}`

Shell script: count the copies of each line from a txt

I would like to count the copies of each line in a txt file and I have tried so many things until know, but none worked well. In my case the text has just a word in each line.
This was my last try
echo -n 'enter file for edit: '
read file
for line in $file ; do
echo 'grep -w $line $file'
done; <$file
For example:
input file
a
a
a
c
c
Output file
a 3
c 2
Thanks in advance.
$ sort < $file | uniq -c | awk '{print $2 " " $1}'

Resources