Compare 2 csv files and delete rows - Shell - bash

I have a 2 csv files. One has several columns, the other is just one column with domains. Simplified data of these files would be
file1.csv:
John,example.org,MyCompany,Australia
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
file2.csv:
example.org
google.es
mysite.uk
The output should be
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
I have tried this solution
grep -v -f file2.csv file1.csv >output-file
Found here
http://www.unix.com/shell-programming-and-scripting/177207-removing-duplicate-records-comparing-2-csv-files.html
But since there is no explanation whatsoever about how the script works, and I suck at shell, I cannot tweak it to make it work for me
A solution for this would be highly appreciated, a solution with some explanation would be awesome! :)
EDIT:
I have tried the line that was suppose to work, but for some reason it does not. Here the output from my terminal. What's wrong with this?
Desktop $ cat file1.csv ; echo
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Desktop $ cat file2.csv ; echo
example.org
google.es
mysite.uk
Desktop $ grep -v -f file2.csv file1.csv
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Why grep doesn't remove the line
John,example.org,MyCompany,Australia

The line you posted, works just fine.
$ grep -v -f file2.csv file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
And here's an explanation. grep will search for a given pattern in a given file and print all lines that match. The simplest example of usage is:
$ grep John file1.csv
John,example.org,MyCompany,Australia
Here we used a simple pattern that matches each character, but you can also use regular expressions (basic, extended, and even perl-compatible ones).
To invert the logic, and print only the lines that do not match, we use the -v switch, like this:
$ grep -v John file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
To specify more than one pattern, you can use the option -e pattern multiple times, like this:
$ grep -v -e John -e Lenny file1.csv
Martha,site.com,ThirdCompany,US
However, if there is a larger number of patterns to check for, we might use the -f file option that will read all patterns from a file specified.
So, when we combine all of those; reading patterns from a file with -f and inverting the matching logic with -v, we get the line you need.

One in awk:
$ awk -F, 'NR==FNR{a[$1];next}($2 in a==0)' file2 file1
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
Explained:
$ awk -F, ' # using awk, comma-separated records
NR==FNR { # process the first file, file2
a[$1] # hash the domain to a
next # proceed to next record
}
($2 in a==0) # process file1, if domain in $2 not in a, print the record
' file2 file1 # file order is important

Related

Extracting lines from 2 files using AWK just return the last match

Im a bit new using AWK and im trying to print lines in a file1 that a specific field exists in a file2. I copied exactly examples that I found here but i dont know why its just printing only the last match of the file1.
File1
58000
72518
94850
File2
58000;123;abc
69982;456;rty
94000;576;ryt
94850;234;wer
84850;576;cvb
72518;345;ert
Result Expected
58000;123;abc
94850;234;wer
72518;345;ert
What Im getting
94850;234;wer
awk -F';' 'NR==FNR{a[$1]++; next} $1 in a' file1 file2
What im doing wrong?
awk (while usable here), isn't the correct tool for the job. grep with the -f option is. The -f file option will read the patterns from file one per-line and search the input file for matches.
So in your case you want:
$ grep -f file1 file2
58000;123;abc
94850;234;wer
72518;345;ert
(note: I removed the trailing '\' from the data file, replace it if it wasn't a typo)
Using awk
If you did want to rewrite what grep is doing using awk, that is fairly simple. Just read the contents of file1 into an array and then for processing records from the second file, just check if field-1 is in the array, if so, print the record (default action), e.g.
$ awk -F';' 'FNR==NR {a[$1]=1; next} $1 in a' file1 file2
58000;123;abc
94850;234;wer
72518;345;ert
(same note about the trailing slash)
Thanks #RavinderSingh13!
The file1 really had some hidden characters and I could see it using cat.
$ cat -v file1
58000^M
72518^M
94850^M
I removed using sed -e "s/\r//g" file1 and the AWK worked perfectly.

AWK remove blank lines and append empty columns to all csv files in the directory

Hi I am looking for a way to combine all the below commands together.
Remove blank lines in the csv file (comma delimited)
Add multiple empty columns to each line up to 100th column
Perform action 1 & 2 on all the files in the folder
I am still learning and this is the best I could get:
awk '!/^[[:space:]]*$/' x.csv > tmp && mv tmp x.csv
awk -F"," '($100="")1' OFS="," x.csv > tmp && mv tmp x.csv
They work out individually but I don't know how how to put them together and I am looking for ways to have it run through all the files under the directory.
Looking for concrete AWK code or shell script calling AWK.
Thank you!
An example input would be:
a,b,c
x,y,z
Expected output would be:
a,b,c,,,,,,,,,,
x,y,z,,,,,,,,,,
you can combine in one script without any loops
$ awk 'BEGIN{FS=OFS=","} FNR==1{close(f); f=FILENAME".updated"} NF{$100=""; print > f}' files...
it won't overwrite the original files.
You can pipe the output of the first to the other:
awk '!/^[[:space:]]*$/' x.csv | awk -F"," '($100="")1' OFS="," > new_x.csv
If you wanted to run the above on all the files in your directory, you would do:
shopt -s nullglob
for f in yourdirectory/*.csv; do
awk '!/^[[:space:]]*$/' "${f}" | awk -F"," '($100="")1' OFS="," > new_"${f}"
done
The shopt -s nullglob is so that an empty directory won't give you a literal *. Quoted from a good source for about looping through files
With recent enough GNU awk you could:
$ gawk -i inplace 'BEGIN{FS=OFS=","}/\S/{NF=100;$1=$1;print}' *
Explained:
$ gawk -i inplace ' # using GNU awk and in-place file editing
BEGIN {
FS=OFS="," # set delimiters to a comma
}
/\S/ { # gawk specific regex operator that matches any character that is not a space
NF=100 # set the field count to 100 which truncates fields above it
$1=$1 # edit the first field to rebuild the record to actually get the extra commas
print # output records
}' *
Some test data (the first empty record is empty, the second empty record has a space and a tab, trust me bro):
$ cat file
1,2,3
1,2,3,4,5,6,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
Output of cat file after the execution of the GNU awk program:
1,2,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100

Extract string between two patterns (inclusive) while conserving the format

I have a file in the following format
cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACK
id3,PPLRTOMcccccccccccJACK
I am trying to identify and print the string between TOM and JACK including these two strings, while maintaining the first column FS=,
Desired output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
So far I have tried gsub:
awk -F"," 'gsub(/.*TOM|JACK.*/,"",$2) && !_[$0]++' test.txt > out.txt
and have the following output
id1 aaaaaaaaaaa
id2 bbbbbbbbbbb
id3 ccccccccccc
As you can see I am getting close but not able to include TOM and JACK patterns in my output. Plus I am also losing the original FS. What am I doing wrong?
Any help will be appreciated.
You are changing a field ($2) which causes awk to reconstruct the record using the value of OFS as the field separator and so in this case changing the commas to spaces.
Never use _ as a variable name - using a name with no meaning is just slightly better than using a name with the wrong meaning, just pick a name that means something which, in this case is seen but idk what you are trying to do when using that in this context.
gsub() and sub() do not support capture groups so you either need to use match()+substr():
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/){$2=substr($2,RSTART,RLENGTH)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or use GNU awk for the 3rd arg to match()
$ gawk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or for gensub():
$ gawk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
The main difference between the match() and gensub() solutions is how they would behave if TOM appeared twice on the line:
$ cat file
id1,PPLLfooTOMbarTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACKfooJACKbar
id3,PPLRfooTOMbarTOMcccccccccccJACKfooJACKbar
$
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMbarTOMcccccccccccJACKfooJACK
$
$ awk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMcccccccccccJACKfooJACK
and just to show one way of stopping at the first instead of the last JACK on the line:
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=gensub(/(JACK).*/,"\\1","",a[0])} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMbarTOMcccccccccccJACK
Use capture groups to save the parts of the line you want to keep. Here's how to do it with sed
sed 's/^\([^,]*,\).*\(TOM.*JACK\).*/\1\2/' <test.txt > out.txt
Do you mean to do the following?
$ cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACKABCD
id2,PPLRTOMbbbbbbbbbbbJACKDFCC
id3,PPLRTOMcccccccccccJACKSDER
$ cat test.txt | sed -e 's/,.*TOM/,TOM/g' | sed -e 's/JACK.*/JACK/g'
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
$
This should work as long as the TOM and JACK do not repeat themselves.
sed 's/\(.*,\).*\(TOM.*JACK\).*/\1\2/' <oldfile >newfile
Output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK

sed to read one file and delete pattern in second file

I have a list of users that have logged in during the past 180 days and a list of total users from LDAP. Each list is within text file and each name is on its own line. With sed or awk can I have a script read from current-users.txt and delete from total-users.txt to give me a text document that has all of the inactive accounts for the past 180 days?
Thanks in advance!
No need to sed or awk, grep suffices:
grep -vf current-users.txt total-users
It returns all the lines that are in total_users but not in current-users.txt.
grep -f gets parameters from a file.
grep -v inverts the result.
Example
$ cat total_users
one
two
three
four
$ cat some_users
two
four
$ grep -vf some_users total_users
one
three
Using awk
awk 'NR==FNR{a[$1];next} !($1 in a)' current_users total_users

grep "output of cat command - every line" in a different file

Sorry title of this question is little confusing but I couldnt think of anything else.
I am trying to do something like this
cat fileA.txt | grep `awk '{print $1}'` fileB.txt
fileA contains 100 lines while fileB contains 100 million lines.
What I want is get id from fileA, grep that id in a different file-fileB and print that line.
e.g fileA.txt
1234
1233
e.g.fileB.txt
1234|asdf|2012-12-12
5555|asdd|2012-11-12
1233|fvdf|2012-12-11
Expected output is
1234|asdf|2012-12-12
1233|fvdf|2012-12-11
Getting rid of cat and awk altogether:
grep -f fileA.txt fileB.txt
awk alone can do that job well:
awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' fileA fileB
see the test:
kent$ head a b
==> a <==
1234
1233
==> b <==
1234|asdf|2012-12-12
5555|asdd|2012-11-12
1233|fvdf|2012-12-11
kent$ awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' a b
1234|asdf|2012-12-12
1233|fvdf|2012-12-11
EDIT
add explanation:
-F'|' #| as field separator (fileA)
'NR==FNR{a[$0];next;} #save lines in fileA in array a
$1 in a #if $1(the 1st field) in fileB in array a, print the current line from FileB
for further details I cannot explain here, sorry. for example how awk handle two files, what is NR and what is FNR.. I suggest that try this awk line in case the accepted answer didn't work for you. If you want to dig a little bit deeper, read some awk tutorials.
If the id's are on distinct lines you could use the -f option in grep as such:
cut -d "|" -f1 < fileB.txt | grep -F -f fileA.txt
The cut command will ensure that only the first field is searched for in the pattern searching using grep.
From the man page:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line.
The empty file contains zero patterns, and therefore matches nothing.
(-f is specified by POSIX.)

Resources