Compare 2 csv files and delete rows - Shell - bash
I have a 2 csv files. One has several columns, the other is just one column with domains. Simplified data of these files would be
file1.csv:
John,example.org,MyCompany,Australia
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
file2.csv:
example.org
google.es
mysite.uk
The output should be
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
I have tried this solution
grep -v -f file2.csv file1.csv >output-file
Found here
http://www.unix.com/shell-programming-and-scripting/177207-removing-duplicate-records-comparing-2-csv-files.html
But since there is no explanation whatsoever about how the script works, and I suck at shell, I cannot tweak it to make it work for me
A solution for this would be highly appreciated, a solution with some explanation would be awesome! :)
EDIT:
I have tried the line that was suppose to work, but for some reason it does not. Here the output from my terminal. What's wrong with this?
Desktop $ cat file1.csv ; echo
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Desktop $ cat file2.csv ; echo
example.org
google.es
mysite.uk
Desktop $ grep -v -f file2.csv file1.csv
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Why grep doesn't remove the line
John,example.org,MyCompany,Australia
The line you posted, works just fine.
$ grep -v -f file2.csv file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
And here's an explanation. grep will search for a given pattern in a given file and print all lines that match. The simplest example of usage is:
$ grep John file1.csv
John,example.org,MyCompany,Australia
Here we used a simple pattern that matches each character, but you can also use regular expressions (basic, extended, and even perl-compatible ones).
To invert the logic, and print only the lines that do not match, we use the -v switch, like this:
$ grep -v John file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
To specify more than one pattern, you can use the option -e pattern multiple times, like this:
$ grep -v -e John -e Lenny file1.csv
Martha,site.com,ThirdCompany,US
However, if there is a larger number of patterns to check for, we might use the -f file option that will read all patterns from a file specified.
So, when we combine all of those; reading patterns from a file with -f and inverting the matching logic with -v, we get the line you need.
One in awk:
$ awk -F, 'NR==FNR{a[$1];next}($2 in a==0)' file2 file1
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
Explained:
$ awk -F, ' # using awk, comma-separated records
NR==FNR { # process the first file, file2
a[$1] # hash the domain to a
next # proceed to next record
}
($2 in a==0) # process file1, if domain in $2 not in a, print the record
' file2 file1 # file order is important
Related
Extracting lines from 2 files using AWK just return the last match
Im a bit new using AWK and im trying to print lines in a file1 that a specific field exists in a file2. I copied exactly examples that I found here but i dont know why its just printing only the last match of the file1. File1 58000 72518 94850 File2 58000;123;abc 69982;456;rty 94000;576;ryt 94850;234;wer 84850;576;cvb 72518;345;ert Result Expected 58000;123;abc 94850;234;wer 72518;345;ert What Im getting 94850;234;wer awk -F';' 'NR==FNR{a[$1]++; next} $1 in a' file1 file2 What im doing wrong?
awk (while usable here), isn't the correct tool for the job. grep with the -f option is. The -f file option will read the patterns from file one per-line and search the input file for matches. So in your case you want: $ grep -f file1 file2 58000;123;abc 94850;234;wer 72518;345;ert (note: I removed the trailing '\' from the data file, replace it if it wasn't a typo) Using awk If you did want to rewrite what grep is doing using awk, that is fairly simple. Just read the contents of file1 into an array and then for processing records from the second file, just check if field-1 is in the array, if so, print the record (default action), e.g. $ awk -F';' 'FNR==NR {a[$1]=1; next} $1 in a' file1 file2 58000;123;abc 94850;234;wer 72518;345;ert (same note about the trailing slash)
Thanks #RavinderSingh13! The file1 really had some hidden characters and I could see it using cat. $ cat -v file1 58000^M 72518^M 94850^M I removed using sed -e "s/\r//g" file1 and the AWK worked perfectly.
AWK remove blank lines and append empty columns to all csv files in the directory
Hi I am looking for a way to combine all the below commands together. Remove blank lines in the csv file (comma delimited) Add multiple empty columns to each line up to 100th column Perform action 1 & 2 on all the files in the folder I am still learning and this is the best I could get: awk '!/^[[:space:]]*$/' x.csv > tmp && mv tmp x.csv awk -F"," '($100="")1' OFS="," x.csv > tmp && mv tmp x.csv They work out individually but I don't know how how to put them together and I am looking for ways to have it run through all the files under the directory. Looking for concrete AWK code or shell script calling AWK. Thank you! An example input would be: a,b,c x,y,z Expected output would be: a,b,c,,,,,,,,,, x,y,z,,,,,,,,,,
you can combine in one script without any loops $ awk 'BEGIN{FS=OFS=","} FNR==1{close(f); f=FILENAME".updated"} NF{$100=""; print > f}' files... it won't overwrite the original files.
You can pipe the output of the first to the other: awk '!/^[[:space:]]*$/' x.csv | awk -F"," '($100="")1' OFS="," > new_x.csv If you wanted to run the above on all the files in your directory, you would do: shopt -s nullglob for f in yourdirectory/*.csv; do awk '!/^[[:space:]]*$/' "${f}" | awk -F"," '($100="")1' OFS="," > new_"${f}" done The shopt -s nullglob is so that an empty directory won't give you a literal *. Quoted from a good source for about looping through files
With recent enough GNU awk you could: $ gawk -i inplace 'BEGIN{FS=OFS=","}/\S/{NF=100;$1=$1;print}' * Explained: $ gawk -i inplace ' # using GNU awk and in-place file editing BEGIN { FS=OFS="," # set delimiters to a comma } /\S/ { # gawk specific regex operator that matches any character that is not a space NF=100 # set the field count to 100 which truncates fields above it $1=$1 # edit the first field to rebuild the record to actually get the extra commas print # output records }' * Some test data (the first empty record is empty, the second empty record has a space and a tab, trust me bro): $ cat file 1,2,3 1,2,3,4,5,6, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101 Output of cat file after the execution of the GNU awk program: 1,2,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 1,2,3,4,5,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
Extract string between two patterns (inclusive) while conserving the format
I have a file in the following format cat test.txt id1,PPLLTOMaaaaaaaaaaaJACK id2,PPLRTOMbbbbbbbbbbbJACK id3,PPLRTOMcccccccccccJACK I am trying to identify and print the string between TOM and JACK including these two strings, while maintaining the first column FS=, Desired output: id1,TOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACK id3,TOMcccccccccccJACK So far I have tried gsub: awk -F"," 'gsub(/.*TOM|JACK.*/,"",$2) && !_[$0]++' test.txt > out.txt and have the following output id1 aaaaaaaaaaa id2 bbbbbbbbbbb id3 ccccccccccc As you can see I am getting close but not able to include TOM and JACK patterns in my output. Plus I am also losing the original FS. What am I doing wrong? Any help will be appreciated.
You are changing a field ($2) which causes awk to reconstruct the record using the value of OFS as the field separator and so in this case changing the commas to spaces. Never use _ as a variable name - using a name with no meaning is just slightly better than using a name with the wrong meaning, just pick a name that means something which, in this case is seen but idk what you are trying to do when using that in this context. gsub() and sub() do not support capture groups so you either need to use match()+substr(): $ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/){$2=substr($2,RSTART,RLENGTH)} 1' file id1,TOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACK id3,TOMcccccccccccJACK or use GNU awk for the 3rd arg to match() $ gawk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file id1,TOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACK id3,TOMcccccccccccJACK or for gensub(): $ gawk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file id1,TOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACK id3,TOMcccccccccccJACK The main difference between the match() and gensub() solutions is how they would behave if TOM appeared twice on the line: $ cat file id1,PPLLfooTOMbarTOMaaaaaaaaaaaJACK id2,PPLRTOMbbbbbbbbbbbJACKfooJACKbar id3,PPLRfooTOMbarTOMcccccccccccJACKfooJACKbar $ $ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file id1,TOMbarTOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACKfooJACK id3,TOMbarTOMcccccccccccJACKfooJACK $ $ awk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file id1,TOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACKfooJACK id3,TOMcccccccccccJACKfooJACK and just to show one way of stopping at the first instead of the last JACK on the line: $ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=gensub(/(JACK).*/,"\\1","",a[0])} 1' file id1,TOMbarTOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACK id3,TOMbarTOMcccccccccccJACK
Use capture groups to save the parts of the line you want to keep. Here's how to do it with sed sed 's/^\([^,]*,\).*\(TOM.*JACK\).*/\1\2/' <test.txt > out.txt
Do you mean to do the following? $ cat test.txt id1,PPLLTOMaaaaaaaaaaaJACKABCD id2,PPLRTOMbbbbbbbbbbbJACKDFCC id3,PPLRTOMcccccccccccJACKSDER $ cat test.txt | sed -e 's/,.*TOM/,TOM/g' | sed -e 's/JACK.*/JACK/g' id1,TOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACK id3,TOMcccccccccccJACK $ This should work as long as the TOM and JACK do not repeat themselves.
sed 's/\(.*,\).*\(TOM.*JACK\).*/\1\2/' <oldfile >newfile Output: id1,TOMaaaaaaaaaaaJACK id2,TOMbbbbbbbbbbbJACK id3,TOMcccccccccccJACK
sed to read one file and delete pattern in second file
I have a list of users that have logged in during the past 180 days and a list of total users from LDAP. Each list is within text file and each name is on its own line. With sed or awk can I have a script read from current-users.txt and delete from total-users.txt to give me a text document that has all of the inactive accounts for the past 180 days? Thanks in advance!
No need to sed or awk, grep suffices: grep -vf current-users.txt total-users It returns all the lines that are in total_users but not in current-users.txt. grep -f gets parameters from a file. grep -v inverts the result. Example $ cat total_users one two three four $ cat some_users two four $ grep -vf some_users total_users one three
Using awk awk 'NR==FNR{a[$1];next} !($1 in a)' current_users total_users
grep "output of cat command - every line" in a different file
Sorry title of this question is little confusing but I couldnt think of anything else. I am trying to do something like this cat fileA.txt | grep `awk '{print $1}'` fileB.txt fileA contains 100 lines while fileB contains 100 million lines. What I want is get id from fileA, grep that id in a different file-fileB and print that line. e.g fileA.txt 1234 1233 e.g.fileB.txt 1234|asdf|2012-12-12 5555|asdd|2012-11-12 1233|fvdf|2012-12-11 Expected output is 1234|asdf|2012-12-12 1233|fvdf|2012-12-11
Getting rid of cat and awk altogether: grep -f fileA.txt fileB.txt
awk alone can do that job well: awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' fileA fileB see the test: kent$ head a b ==> a <== 1234 1233 ==> b <== 1234|asdf|2012-12-12 5555|asdd|2012-11-12 1233|fvdf|2012-12-11 kent$ awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' a b 1234|asdf|2012-12-12 1233|fvdf|2012-12-11 EDIT add explanation: -F'|' #| as field separator (fileA) 'NR==FNR{a[$0];next;} #save lines in fileA in array a $1 in a #if $1(the 1st field) in fileB in array a, print the current line from FileB for further details I cannot explain here, sorry. for example how awk handle two files, what is NR and what is FNR.. I suggest that try this awk line in case the accepted answer didn't work for you. If you want to dig a little bit deeper, read some awk tutorials.
If the id's are on distinct lines you could use the -f option in grep as such: cut -d "|" -f1 < fileB.txt | grep -F -f fileA.txt The cut command will ensure that only the first field is searched for in the pattern searching using grep. From the man page: -f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.)