BASH - How to check for duplicate email addresses across multiple files? - bash

I'm currently working on a project where I need to send an email to a large number of email addresses. As such I am attempting to avoid any "temporary" glitches with respect to service providers throttling emails etc.
My plan is to take the initial list of email addresses and chop it up into smaller (chopped) lists, so that they can be scheduled in a staggered manner. Due to the sensitive nature of sending emails, I want to ensure that no duplicate email addresses exist across any of the chopped lists. Is there a way to do this via bash?
Side note, I am 100% certain that all email addresses in the master list are unique, due to the nature of the query used to comprise the list, I would just like to ensure, my script which chopped the master list, does not have a defect creating duplicate email addresses across the chopped lists.

You can put the chopped files together (temporarily) via cat and use sort --unique to remove duplicates - then check if the result has as many lines as the original file:
cat original_list | wc -l
and
cat list_part* | sort --unique | wc -l
if the results are same there are no duplicates.

Try
cat *.txt | sort | sort -u -c
given that your filenames are ending with .txt.
The first sort command orders all email addresses. The second sort command checks that no two consecutive lines are equal and throws an error in the other case.

The Problem
You need to sort unique addresses, and then split the ordered list into chunks.
The Solution
Given the following assumptions:
Your emails are stored in files called emails_xxxx.txt. (Note: You can name them anything you like, but a sensible set of filenames that are easy to glob will make your life simpler.)
Each line holds one address.
you can handle this with a short pipeline. Sort will accept a glob pattern or multiple file arguments (e.g. from xargs), so you can avoid the "useless use of cat." You then pipe the output into split, where you can control various aspects of the chunking. For example:
sort --unique emails_*.txt |
split --numeric-suffixes \
--lines=200 \
--suffix-length=4 \
--verbose
This splits the sorted/filtered lines into chunks of up to 200 lines each, and names each chunk with a numeric extension suitable for batch processing. You can adjust the lines and suffix length to suit your requirements.
Sample Output
creating file `x0000'
creating file `x0001'

Related

Terminal: SORT command; how to sort correctly?

I have written a shell script that gets all the file names from a folder, and all its sub-folders, and copies them to the clipboard after sorting (removing all paths; I just need a simple file list of the thousands of randomly named files within).
What I can’t figure out is how to get the SORT command to sort properly. Meaning, the way a spreadsheet would sort things. Or the way your Mac finder sorts things.
Underscores > numbers > letters (regardless of case)
Anyone know how to do this? Sort -n only works for files starting with numbers, sort -f was close but separated the lower case and capitals in a weird way, and anything starting with a number was all over the place. Sort -V was the closest, but anything started with an underscore went to the bottom instead of the top… I’m about to lose my mind. 🤣
I’ve been trying to figure this out for a week, and no combination of anything I have tried gets the sort command to actually, ya know, sort properly.
Help?
If I understand the problem correctly, you want the "natural sort order" as described in Natural sort order - Wikipedia, Sorting for Humans : Natural Sort Order, and macos - How does finder sort folders when they contain digits and characters?.
Using Linux sort(1) you need the -V (--version-sort) option for "natural" sort. You also need the -f (--ignore-case) option to disregard the case of letters. So, assuming that the file names are stored one-per-line in a file called files.txt you can produce a list (mostly) sorted in the way that you want with:
sort -Vf files.txt
However, sort -Vf sorts underscores after digits and letters on my system. I've tried using different locales (see How to set locale in the current terminal's session?), but with no success. I can't see a way to change this with sort options (but I may be missing something).
The characters . and ~ seem to consistently sort before numbers and letters with sort -V. A possible hack to work around the problem is to swap underscore with one of them, sort, and then swap again. For example:
tr '_~' '~_' <files.txt | LC_ALL=C sort -Vf | tr '_~' '~_'
seems to do what you want on my system. I've explicitly set the locale for the sort command with LC_ALL=C ... so it should behave the same on other systems. (See Why doesn't sort sort the same on every machine?.)
It appears you want to sort in dictionary order and fold case, so it would be sort -df.

Merging and appending two big delimited files

I got two huge comma delimited files.
The 1st file has 280 million lines and the following columns
first name, last name, city, state, ID, email*, phone
John,Smith,LA,CA,123123123123,johnsmith#yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen#gmail.com,12041284
The 2nd file has 420 million lines and the following columns
first name, last name, city, state, email
John,Smith,LA,CA,johnsmith#hotmail.com
Bob,Marble,SF,CA,bobmarble#gmail.com
Tai,Nguyen,SD,CA,tainguyen#gmail.com
* a lot of these fields are empty
I want to merge all the lines from both files that has the first 4 columns match. Then fill in the missing emails of the first file by the emails from the second files if the email is not blank then don't change it. The process should be case insensitive. In case there are many instances that have the same 4 information then just ignore these instance and do the work on unique instances only.
The result should have the following columns and look like this
first name, last name, city, state, ID, email, phone
John,Smith,LA,CA,123123123123,johnsmith#yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,bobmarble#gmail.com,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen#gmail.com,12041284
They should only print out things that has 4 columns matched not 1 or 2 or 3. My boss insist on using Bash shell script for this and I am a newbie in Bash. Please help me with a clear explanation as I am so newbie.
I do my reading and understand that awk require storing information into cpu memory. However, I can split the big files into small files and use awk in that case. I copy some code online and change it to my need but whenever it fills in the blank email, it also reformats the line delimiter from comma into space. I want to stop that but don't know how. Please help me to solve this problem. All advises and answers are highly appreciated.
awk -F "," 'NR==FNR{a[$1,$2,$3,$4]=$5;next}{if ($6 =="") $6=a[$1,$2,$3,$4];print}' file2.txt file1.txt > file3.txt
The awk approach you showed is not suited for files that big. It stores parts of the files in memory. With the same approach you would need to store either ... or ...
280 million entries of the form first name, last name, city, state → ID, phone
420 million entries of the form first name, last name, city, state → email
Assume we go with the first option and each entry takes up only 50 bytes of memory. To store all 280 million entries we need 280M·50B = 14'000 MB = 14 GB. This is the absolute minimum of memory you need to run the awk command. In reality it would be even more due to implementation details of associative arrays.
What you can do instead
Use the classical approach to the problem:
sort both files
join the files by their first four columns*
cut the desired columns from the joined result**
* needs some pre- and post-processing as join can only join one column.
** Since we have to re-arrange the email column cut is not sufficient. We can use awk instead.
#! /bin/bash
prefixWithKey() {
sed -E 's/([^,]*,){4}/\L&\E\t&/' "$1"
}
sortByKeyInPlace() {
sort -t $'\t' -k1,1 -o "$1" "$1"
}
joinByKey() {
join -t $'\t' "$1" "$2"
}
cutColumns() {
awk 'BEGIN{FS="\t|,\t*"; OFS=","} {print $5,$6,$7,$8,$9,$16,$11}'
}
file1="your 1st input file.csv"
file2="your 2nd input file.csv"
for i in "$file1" "$file2"; do
prefixWithKey "$i" > "$i.tmp"
sortByKeyInPlace "$i.tmp"
done
joinByKey "$file1.tmp" "$file2.tmp" | cutColumns > result.csv
rm "$file1.tmp" "$file2.tmp"
This script assumes that the input files have no headers and contain no tabs. We always take the email field from the 2nd file, no matter whether the email field of the 1st file was defined or not.
I barely tested this script because you didn't provide any example input. If you encounter some errors and share a short input leading to that error I would be happy to fix the script (if it needs fixing).
In theory the script could be written without temporary files. I intentionally used temporary files because of the input size. Programs like sort may run faster on files.
This script could be speed up, for instance by
Executing both calls to prefixWithKey in parallel.
Adding LC_ALL=C in front of commands like sort.
Adding options to sort, for instance -S 70%.
Further Alternatives
For files that big it could be faster to store them into a database and process them there. There is even the tool q for doing thinks like this in a single command, but from what I experienced it's very slow.

Bash: Sort file numerically, but only where the first field matches a pattern

Due to poor past naming practices, I'm left with a list of names that is proving to be a challenge to work with. The bottom line is that I want the most current name (by date) to be placed in a variable. All the names are listed (unsorted) in a file called bar.txt.
In this case I can't rename, and there's no way to get the actual dates of the images; these names are all I have to go on. The names can follow one of several patterns;
foo
YYYYMMDD-foo
YYYYMMDD##-foo
foo can be anything from a single character to a long string of letters/numbers/symbols. I am interested only in the names matching the second use case, YYMMDD-foo, as those are from after we started tagging consistently.
I would like to end up with a variable containing the most recent date that follows the pattern YYMMDD-foo.
I know sort -k1 -n < bar.txt, but then I'm not sure how to isolate the second pattern's results to extract what I need.
How do I sort the file to ignore anything but the second pattern, and return the most current date?
Sample
Given that bar.txt looks like this;
test
2017120901-develop-BUILD-31
20170326-TEST-1.2.0
20170406-BUILD-40-1.2.0-test
2010818_001
I would want to extract 20170406-BUILD-40-1.2.0-test
Since your requirement involves 1) to get only files of a certain format 2) apply sorting and get only the latest file. Am using a Awk & GNU sort together to achieve it
awk -F'-' 'length($1) == 8' file | sort -nrk1 | head -1
20170406-BUILD-40-1.2.0-test
The solution works by only getting those lines in the file whose first column has 8 characters exactly corresponding to YYYYMMDD alignment. Once those filtered, sort applied on first field and the first line is obtained using head.

sed delete unmatched lines between two lines with bash variable

I need help understanding a weird problem with sed, bash and a while loop.
MY data looks like this:
-File 1- CSV
account,hostnames,status,ipaddress,port,user,pass
-File 2- XML - This is a sample record set for two entries under one account
<accountname="account">
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
<cname="fqdn or simple name goes here">
<field="hostname">ahostname or ipv4 goes here</field>
<protocol>aprotocol</protocol>
<field="port">aportnumber</field>
<field="username">ausername</field>
<field="password">apassword</field>
</cname>
</accountname>
So far, I can add records in between the respective account holder from File1 to File2. But, if I need to remove records that no longer exists it does not work efficiently since it wipes other records from different accounts, ie it does not delete between a matched accountname.
I import from File 1 into File 2 with a while loop in my bash program:
-Bash Program excerpts-
//Read File in to F//
cat File 2 | while read F
do
//extract fields from F into variables
_vmname="$(echo $F |grep 'cname'| sed 's/<cname="//g' |sed 's/.\{2\}$//g')"
_account="$(echo $F | grep 'accountname' | sed 's/accountname="//g' |sed 's/.\{2\}$//g')"
// I then compare my File1 and look for stale records that are still in File2
if grep "$_vmname" File1 ;then
continue
else
// if not matched, delete between the respective accountname
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
If I manually declare _vmname and _account and run
sed -i '/'"$_account"'/,/<\/accountname>/ {/'"$_vmname"'/,/<\/cname>/d}' File2
It removes the stale records from File2. When I let my bash script run, it does not.
I think I have three problems:
Reading the variables for _vmname and _account name inside a loop makes it read numerous times. Any better way to do is appreciated.
I do not think the sed statement for matching these two patterns and then delete works like I want inside a while loop.
I may have a logic problem with my thought chain.
Any pointers, and please no awk, perl, lxml or python for this one.
Thanks!
and please no awk
I appreciate that you want to keep things simple, and I suppose awk seems more complicated than what you're doing. But I'd like to point out you have so far 3 grep and 4 sed invocations per line in the file, to process another file N times, once per line. That's O(mn) using the slowest method on the planet to read the file (a while loop). And it doesn't work.
I may have a logic problem with my thought chain.
I'm afraid we must allow for that possibility!
The right advice is to tackle XML with an XML parser, because XML is not a regular language and so can't be parsed with regular expressions. And that's really what you need here, because your program processes the whole XML document. You're not just plucking out bits and depending on incidental formatting artifacts; you want to add records that aren't there and remove those that "no longer exist". Apparently there is information in the XML document you need to preserve, else you would just produce it from the CSV. A parser would spoon-feed it to you.
The second-best advice is to use awk. I suppose you might try an approach like:
Process the CSV and produce the XML to be inserted.
In awk, first read the new input XML into an array keyed by cname, Then process the XML target once. For every CNAME, consult your array; if you find a match, insert your pre-constructed XML replacement (or modify the "paragraph" accordingly).
I'm not sure what the delete criteria are, so I don't know if it can be done in the same pass with step #2. If not, extract the salient information somehow. Maybe print a list of keys from each of the two files, and use comm(1) to produce a list of to-be-deleted. Then, similar to step #2, read in that list, and process the XML file one more time. Write anything you delete to stderr so you can keep track of what went missing, from what lines.
Any pointers
Whenever you find yourself processing the same file N times for N inputs, you know you're headed for trouble. One of the two inputs is always smaller, and that one can be put in some kind of array. cat file | while read is another warning signal, telling you use awk or any of a dozen obvious utilities that understand lines of text.
You posted your question on SO two weeks ago. I suspect no one answered it because you warned them away: preemptively saying, in effect, don't tell me to use good tools. I'm only here to suggest that you'll be more comfortable after you take off that straightjacket. Better tools, in this case, are the only right answer.

Sort ignores an apostrophe - sometimes (except when it is the only column used); WHY?

This happens to me both on Linux and on cygwin, so I suspect it is not a bug. Still, I don't understand it. Can anyone explain?
Consider the following file (tab-delimited, and that's a regular apostrophe)
(I create it with cat to ensure that it wasn't non-printing characters that were the source of the problem)
$cat > temp
cat 1389
cat' 1747
ca't 3175
cat 46848484
ca't 720
$sort temp
<gives the exact same output as cat temp>
$sort -k1,1 temp
cat 1389
cat 46848484
cat' 1747
ca't 3456
ca't 720
Why do I have to ignore the second column in order to sort correctly?
I pulled up the manual for sort and noticed the following:
* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
As it turns out, locales actually specify how lexicographic ordering works for a given locale. This makes a lot of sense, but for some reason it trips over multi field files...
(see also:)
Unusual behaviour of linux's sort command
Why does the sort command sort differently if there are trailing fields?
There are a couple of things you can do:
You can sort naively by byte value using
LC_ALL="C" sort temp
This will give a more logical result, but it might not be the one you actually want.
You could try to get sort to do a more basic lexicographical ordering by setting the locale to C and telling it you want dictionary ordering:
LC_ALL="C" sort -d temp
To have sort output your locale information and hilight the sort key, you can use
sort --debug temp
Personally I'm really curious to know what rule is being specified that makes sort behave unintuitively across multiple fields.
They're supposed to specify correct lexicographic order in the given language and dialect. Do the locales' functions simply not handle the multiple field case at all, or are they taking some kind of different interpretation on the "meaning" of the line?

Resources