Pulling unique email addresses out of a file

I’m taking a couple of days off of work, and trying to get caught up on some stuff that I’d let sit during the baby, move and new job. One of those things is my emails. I own my own domain, and I have an email address which is a catchall for anything that gets sent. It’s great, because I can use somecompany@mydomain.com and find out who’s using my emails for Spam (most shocking? Ameritrade.)

Anyway, I have some server side spam filtering which marks anything as medium or low spam with a special text that I route into some folders. I wanted to find out which email addresses were in those folders to do a quick check to make sure no legitimate emails were finding their way in there. I used Thunderbird as my email client for my home stuff on Linux, and all of the emails are stored in seperate plaintext files, making it nice and easy to search through.

The first thing I had to do was find a regex to pull the info. Armed with that and grep, I pulled out the emails:

foyc@dilbert ~ $ grep -o [A-Za-z0-9.%-]*@mydomain.com MyMailFolder > emails.txt

That command finds all of the lines matching the specified regex. Passing in the -o option returns only the matching portion of the line.

Next, I have a big list of emails, so I need an easy way to remove the duplicates. Exactly what uniq is for. It works by removing matching lines in a sorted file. Since my emails.txt file isn’t sorted, I’ll just use the sort command to do that on the fly:

foyc@dilbert ~ $ sort emails.txt | uniq > uniqueemails.txt

and viola! A handy-dandy unique list of all the email addresses used by handy-dandy spammers in that file.

2 thoughts on “Pulling unique email addresses out of a file”

Andy Finkenstadt says:

February 18, 2007 at 11:55 pm

Even better,

sort emails.txt | uniq -c | sort -rn > unique_emails_by_frequency.txt

:)
Ian says:

February 23, 2007 at 1:50 pm

I just can’t believe that you are using grep. Especially after giving me such a hard time about it. You make me sad. :)

Ian

Comments are closed.