Removing duplicate e-mail addresses from mailing list

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Removing duplicate e-mail addresses from mailing list

emma matthews

12:25 pm on Nov 14, 2002 (gmt 0)

We operate a b2b mailing list, which over the years has grown to many thousands of members.

I want to clean this list out, and remove duplicate company e-mail addresses. I'm not just talking about exact duplicates, i.e. x@y.com and x@y.com, as my software does this already. What I mean, is that I want to keep only one for each server e.g. if x@company1.com has subscribed, eveything else such as y@company1.com is deleted. Furthermore, if a company tries to subscribe with another e-mail address, when it has already subscribed, this should not be permitted.

This seems simple enogh, but I've been pulling my hair out searching google and various cgi sites for ages without any results. Any ideas?

aspdaddy

12:40 pm on Nov 14, 2002 (gmt 0)

What software are you using?

<added>

Something like this should do it (JET SQL):
SELECT DISTINCT Right([EmailAddress],InStr([Emailaddress],"@")+1) AS [Domain], Left([EmailAddress],InStr([Emailaddress],"@")-1) AS [User] INTO Temp
FROM [Mail List]

Then :

SELECT User & '@' & Domain FROM Temp WHERE User IN
(SELECT FIRST (User) FROM Temp HAVING COUNT(User)>1)

Tho probably a much easier way :)
</added>

jatar_k

5:17 pm on Nov 14, 2002 (gmt 0)

Welcome to WebmasterWorld [webmasterworld.com] emma_matthews

What language are you looking for?

emma matthews

12:58 pm on Nov 15, 2002 (gmt 0)

I was looking for a cgi script to do this. I am fairly new to this, and at the moment use this script to send my emails and manage the list:

[cgi-factory.com...]

However what I ideally need is a cgi script that I could run to perform the task, i.e. I input my e-mail list(initially and then at regular intervals) and it gives me a clean one without the duplicate servers.

If this doesn't make sense, apologies, as I said I am very new to cgi/perl programming.

bird

1:26 pm on Nov 15, 2002 (gmt 0)

Do you have access to a Linux or other unix system with the GNU tools installed? The following command would do the trick there, without any need to shoehorn the whole thin through CGI:

sort -i -t @ -k 2 -u old_email_list.txt > new_email_list.txt

This assumes that your addresses are stored one per line in a plain text file. It will ignore all whitespace (-i), split each line (internally) at the seperator "@", sort along the second key (after the "@") and return a unique (-u) list, where each of the sorted keys occurs only once.

The only potential problem with this approach is that you'll have no control about which of several addresses from each domain gets through and which are discarded. But this problem will be common to all aproaches other than manual elimination.

Preventing someone from subscribing with an already registered domain would need to be integrated into your existing subscription mechanism, and will therefore be more complex. I don't expect any existing software to support this, as it is quite an unusual requirement (do you really only want to allow eg. one hotmail user?)

Dante_Maure

12:38 am on Nov 16, 2002 (gmt 0)

I know that this was really a technical query, so this may be a bit off topic, but I have to ask...

Why do you want to prevent more than one person at a given company from receiving your mailings?

emma matthews

12:54 pm on Nov 20, 2002 (gmt 0)

Dante_Maure: This is because I mail out eVouchers to my subscribers. Although it's meant to be only 1 per company, they often abuse the system when they receive multiple vouchers.

bird: Cheers! This worked a treat. Now that I have 2 files, old_email_list.txt and new_email_list.txt is there any way I can get a list of the differences between the 2 files(i.e. the ones discarded, for my records) using GNU Tools? Then I can check manually for things like hotmail, aol etc. addresses they have been accidently deleted.

As regards preventing this happening in the future, surely a simple cgi script could do the trick. I could have a form on my website(unrelated to the existing mailing list software), and if someone typed in their e-mail address it checks the domain against the existing list of subscribers, and if unique it adds it to file1, but if not to file 2. The unqiue ones can then be added to my mailing list software later, without having to interfere in the existing software. Is there no cgi script that will do this(mailing list functionality not needed)?

Thanks.

emma matthews

2:06 pm on Jan 7, 2003 (gmt 0)

A couple more questions:

1. Is there any way of stopping it resorting the list back into alphabetical order by server when using 'sort.' I would prefer the list to be left as it is, with repeated instances of any given server deleted.

2. Is there any way of comparing two lists and outputing the difference using GNU tools(see above question) or not?

Thanks.

bird

7:46 pm on Jan 7, 2003 (gmt 0)

I just wanted to recommend the uniq command, but that only operates on sorted files... What's the problem with sorting? I'm sure you could convince sed or awk to do what you want, but I can't think of a simple way right now.

diff is used to compare files. But that will only be useful with sorted files either.