Author: Gaurav Srivastava
What is email address
harvesting?
Email harvesting is the process of
obtaining lists of email addresses using various methods for use in bulk email
or other purposes usually grouped as spam. Speaking of web sites, spammers have
programs which spider through web pages looking for email addresses. Email
address harvesting is done using special software known as "harvesting
bots", "harvesting robots", or "harvesters" which
crawl web pages and capture every email address they find Email address
harvesting is bad because once your or your client's email address gets into
spammers' lists, it will get flooded with spam and trash very quickly. And how
it relates to you? If you run, design or build a web page, you need to take
preventive steps to protect email addresses from getting harvested. (If not,
you could even be held liable.) And if your email address is displayed
somewhere, well you do not want to have to create a new email account soon just
because it gets spammed, right.
Why would anyone want to harvest email addresses?
Spam. Phishing. Spoofing. Direct marketing.
All these techniques are used with one goal - to sell you something or to
conduct some illegal activity leading to getting some monetary or other benefit
from you. If someone with bad intentions has your email address, you can become
target of his or her bad intentions. Harvesting email addresses alone can make
money as well. There are many spammers that just collect email address lists
only to sell them to marketing companies.
There are many ways in which spammers can get your
email address. The ones I know of are:
1. From posts to UseNet with your email address.
2. From mailing lists
3. From web pages.
4. From various web and paper forms.
5. Via an I dent daemon.
6. From a web browser.
7. From IRC and chat rooms.
8. From finger daemons.
9. AOL profiles
10. From domain contact points.
11. By guessing & cleaning.
12. From white & yellow pages.
13. By having access to the same computer.
14. From a previous owner of the email address.
15. Using social engineering.
16. Buying lists from others.
17. By hacking into sites.
1. From posts to UseNet
with your email address.
Spammers regularily scan UseNet for email address,
using readymade programs designed to do so. Some programs just look at articles
headers which contain email address (From:, Reply-To:, etc), while other
programs check the articles’ bodies, starting with programs that look at
signatures, through programs that take everything that contain a ‘@’ character and attempt
to demunge munged email addresses.
There have been reports of spammers demunging email
addresses on occasions, ranging from demunging a single address for purposes of
revenge spamming to automatic methods that try to unmunge email addresses that
were munged in some common ways, e.g. remove such strings as ‘nospam’ from
email addresses.
As people who where spammed frequently report that
spam frequency to their mailbox dropped sharply after a period in which they
did not post to UseNet, as well as evidence to spammers’ chase after ‘fresh’
and ‘live’ addresses, this technique seems to be the primary source of email
addresses for spammers.
2. From mailing lists.
Spammers regularily attempt to get the lists of
subscribers to mailing lists [some mail servers will give those upon request],
knowing that the email addresses are unmunged and that only a few of the
addresses are invalid.
When mail servers are configured to refuse such
requests, another trick might be used - spammers might send an email to the
mailing list with the headers Return-Receipt-To: or X-Confirm-Reading-To: .
Those headers would cause some mail transfer agents and reading programs to
send email back to the saying that the email was delivered to / read at a given
email address, divulging it to spammers.
A different technique used by spammers is to request
a mailing lists server to give him the list of all mailing lists it carries (an
option implemented by some mailing list servers for the convenience of
legitimate users), and then send the spam to the mailing list’s address,
leaving the server to do the hard work of forwarding a copy to each subscribed
email address.
[I know spammers use this trick from bad experience
- some spammer used this trick on the list server of the company for which I
work, easily covering most of the employees, including employees working well
under a month and whose email addresses would be hard to find in other ways.
3. From web pages.
Spammers have programs which spider through web
pages, looking for email addresses, e.g. email addresses contained in mailto:
HTML tags [those you can click on and get a mail window opened]
Some spammers even target their mail based on web pages.
I’ve discovered a web page of mine appeared in Yahoo as some spammer harvested
email addresses from each new page appearing in Yahoo and sent me a spam
regarding that web page.
A widely used technique to fight this technique is
the ‘poison’ CGI script. The script creates a page with several bogus email
addresses and a link to itself. Spammers’ software visiting the page would
harvest the bogus email addresses and follow up the link, entering an infinite
loop polluting their lists with bogus email addresses.
4. From various web and paper forms.
Some sites request various details via forms, e.g.
guest books & registrations forms. Spammers can get email addresses from
those either because the form becomes available on the world wide web, or
because the site sells / gives the emails list to others.
Some companies would sell / give email lists filled
in on paper forms, e.g. organizers of conventions would make a list of
participants’ email addresses, and sell it when it’s no longer needed.
Some spammers would actually type E-mail addresses
from printed material, e.g. professional directories & conference
proceedings.
Domain name registration forms are a favourite as well - addresses are most usually correct and updated, and people read the emails sent to them expecting important messages.
Domain name registration forms are a favourite as well - addresses are most usually correct and updated, and people read the emails sent to them expecting important messages.
5. Via an Ident daemon.
Many unix computers run a daemon (a program which
runs in the background, initiated by the system administrator), intended to
allow other computers to identify people who connect to them.
When a person surfs from such a computer connects to
a web site or news server, the site or server can connect the person’s computer
back and ask that daemon’s for the person’s email address.
Some chat clients on PCs behave similarily, so using
IRC can cause an email address to be given out to spammers.
6. From a web browser.
Some sites use various tricks to extract a surfer’s
email address from the web browser, sometimes without the surfer noticing it.
Those techniques include :
1.
Making the browser fetch one of the page’s images through an
anonymous FTP connection to the site. Some
browsers would give the email address the user has configured into the browser
as the password for the anonymous FTP account. A surfer not aware of this
technique will not notice that the email address has leaked.
2.
Using JavaScript to make the browser send an email to a chosen
email address with the email address configured into the browser. Some browsers would allow email to be sent
when the mouse passes over some part of a page. Unless the browser is properly
configured, no warning will be issued.
3.
Using the HTTP_FROM header that browsers send to the server. Some browsers pass a header with your email
address to every web server you visit.
It’s worth noting here that when one reads E-mail
with a browser (or any mail reader that understands HTML), the reader should be
aware of active content (Java applets, Javascript, VB, etc) as well as web bugs.
An E-mail containing HTML may contain a script that
upon being read (or even the subject being highlighted) automatically sends
E-mail to any E-mail addresses. A good example of this case is the Melissa
virus. Such a script could send the spammer not only the reader’s E-mail
address but all the addresses on the reader’s address book.
7.
From IRC and chat rooms.
Some IRC clients will give a user’s email address to
anyone who cares to ask it. Many spammers harvest email addresses from IRC,
knowing that those are ‘live’ addresses and send spam to those email addresses.
This method is used beside the annoying IRCbots that
send messages interactively to IRC and chat rooms without attempting to
recognize who is participating in the first place.
This is another major source of email addresses for
spammers, especially as this is one of the first public activities newbies join,
making it easy for spammers to harvest ‘fresh’ addresses of people who might
have very little experience dealing with spam.
AOL chat rooms are the most popular of those -
according to reports there’s a utility that can get the screen names of participants
in AOL chat rooms. The utility is reported to be specialized for AOL due to two
main reasons - AOL makes the list of the actively participating users’ screen
names available and AOL users are considered prime targets by spammers due to
the reputation of AOL as being the ISP of choice by newbies.
8. From finger daemons.
Some finger daemons are set to be very friendly - a
finger query asking for john@host will
produce list info including login names for all people named John on that host.
A query for @host will produce a list of all currently logged-on users.
Spammers use this information to get extensive users
list from hosts, and of active accounts - ones which are ‘live’ and will read
their mail soon enough to be really attractive spam targets.
9. AOL,Google, Facebook, twitter,
RSS feeds profiles.
Spammers harvest AOL names from user profiles lists,
as it allows them to ‘target’ their mailing lists. Also, AOL has a name being
the choice ISP of newbies, who might not know how to recognize scams or know
how to handle spam.
10. From domain contact
points.
Every domain has one to three contact points -
administration, technical, and billing. The contact point includes the email
address of the contact person.
As the contact points are freely available, e.g.
using the ‘whois’ command, spammers harvest the email addresses from the
contact points for lists of domains (the list of domain is usually made
available to the public by the domain registries). This is a tempting methods
for spammers, as those email addresses are most usually valid and mail sent to
it is being read regularily.
11. By guessing &
cleaning.
Some spammers guess email addresses, send a test
message (or a real spam) to a list which includes the guessed addresses. Then
they wait for either an error message to return by email, indicating that the
email address is correct, or for a confirmation. A confirmation could be
solicited by inserting non-standard but commonly used mail headers requesting
that the delivery system and/or mail client send a confirmation of delivery or
reading. No news are, of coures, good news for the spammer.
Specifically, the headers are –
Return-Receipt-To: Send a delivery confirmation
X-Confirm-Reading-To: Send a reading confirmation
Guessing could be done based on the fact that email addresses are based on people’s names, usually in commonly used ways (first.last @domain or an initial of one name followed / preceded by the other @domain)
Also, some email addresses are standard - postmaster
is mandated by the RFCs for internet mail. Other common email addresses are
postmaster, hostmaster, root [for unix hosts], etc.
12. From white & yellow
pages.
There are various sites that serve as white pages,
sometimes named people finders web sites. Yellow pages now have an email
directory on the web.
Those white/yellow pages contain addresses from
various sources, e.g. from UseNet, but sometimes your E-mail address will be
registered for you. Example - HotMail will add E-mail addresses to BigFoot by
default, making new addresses available to the public.
Spammers go through those directories in order to
get email addresses. Most directories prohibit email address harvesting by
spammers, but as those databases have a large databases of email addresses +
names, it’s a tempting target for spammers.
13. By having access to the
same computer.
If a spammer has an access to a computer, he can
usually get a list of valid usernames (and therefore email addresses) on that
computer.
On unix computers the users file (/etc/passwd) is
commonly world readable, and the list of currently logged-in users is listed
via the ‘who’ command.
14. From a previous owner of
the email address.
An email address might have been owned by someone
else, who disposed of it. This might happen with dialup usernames at ISPs -
somebody signs up for an ISP, has his/her email address harvested by spammers,
and cancel the account. When somebody else signs up with the same ISP with the
same username, spammers already know of it.
Similar things can happen with AOL screen names -
somebody uses a screen name, gets tired of it, releases it. Later on somebody
else might take the same screen name.
15. Using social engineering.
This method means the spammer uses a hoax to
convince people into giving him valid E-mail addresses.
A good example is Richard Douche’s “Free CD’s” chain
letter. The letter promises a free CD for every person to whom the letter is
forwarded to as long as it is CC’ed to Richard.
Richard claimed to be associated with Amazon and
Music blvd, among other companies, who authorized him to make this offer. Yet
he supplied no references to web pages and used a free E-mail address.
All Richard wanted was to get people to send him
valid E-mail addresses in order to build a list of addresses to spam and/or
sell.
16. Buying lists from others.
This one covers two types of trades. The first type
consists of buying a list of email addresses (often on CD) that were harvested
via other methods, e.g. someone harvesting email addresses from UseNet and
sells the list either to a company that wishes to advertise via email
(sometimes passing off the list as that of people who opted-in for emailed
advertisements) or to others who resell the list.
The second type consists of a company who got the
email addresses legitimately (e.g. a magazine that asks subscribers for their
email in order to keep in touch over the Internet) and sells the list for the
extra income. This extends to selling of email addresses a company got via
other means, e.g. people who just emailed the company with inquiries in any
context.
17.
By hacking into sites.
I’ve heard rumours that sites that supply free email
addresses were hacked in order to get the list of email addresses, somewhat
like e-commerce sites being hacked to get a list of credit cards.
How to protect web pages from email harvesting - security tips
We already know that email address
harvesting is not good and that spammers have software that searches through
the web and looks for email addresses. Now let's take a look at how to protect
web pages from email harvesting. The following is a list of methods to hide
email addresses from the page source to minimize visibility against the email
harvesting spam bots. Each method has its advantages and disadvantages, so it
is up to you to decide which method suits your needs the most.
1. Plain
HTML code
2. HTML
comments
3. Unicode
characters, hexadecimal or decimal entities.
4. Email
address or its parts displayed as images.
5. Email
address in HTTP redirect.
6. Email
address and mailto as JavaScript.
7. Email
address via CSS2 pseudo-element :after.
8. Email
address through CSS2 Unicode- bidi (text direction).
9. Stuff
email address with CSS display: none.
10. Use
forms for emails
Plain HTML code
First, let's explain how email addresses
are usually displayed at websites and then start with the easy stuff. Email
addresses are often coded into web pages like the following example:
<a
href="mailto:foo@example.com">foo@example.com</a>
This example produces clickable
foo@example.com. If you click this email address, your mail client (i.e.
Outlook) will open up with this email address in the To: field. This email
format is a beauty for email harvesting software, this is exactly what they are
looking for and where they get majority of email addresses.
Some people make the job for email address
harvesting software by writing out the email address as shown in the following
two examples.
<a
href="mailto:foo@example.com">foo[AT]example[DOT]com</a> and
foo[AT]example[DOT]com
This is a bit better than the plain HTML
format but notice that the first example still includes your correct email
address in the mailto field, so email harvesting software still can find you.
The second option leaves out the A HREF tag, so the link will not be clickable
anymore and the visitor will have to copy your email address and paste it into
his or her email client. Substituting @ with [AT] and dot with [DOT] is a nice
idea but there is nothing easier than telling the email harvesting software
"if you find [AT], replace it with @".
Fake email address or switched domains
A good way to protect your email address in
a web page is to fake it for the email harvesting robot and let the human know
that it has been faked.
<a href=mailto:foo@example[REMOVETHIS].com>foo@example[REMOVETHIS].com</a>
or
<a
href=mailto:foo@com.example>foo@com.example</a>
These examples are not bad, but you
have to really let the visitor know that he or she needs to fix the email
address before sending email to it. Many people just blindly click, copy, past,
so you really have to make this visible (perhaps by displaying the [REMOVETHIS]
in red color or formatting with a strikethrough line). This email harvesting
protection technique works well against email harvesting bots because even
though they get the email, it is an invalid one, hence you are safe. On the
other hand, emails in this format may cause confusion to the user, if the idea
is not described well.
HTML comments
You can also protect web pages from email
harvesting by enclosing individual email address parts with HTML comments.
foo<!-- >@. -->@<!-- >@.
-->example<!-- >@. -->.<!-- >@. -->com
This would be displayed as foo@example.com.
Placing the @, ., and > symbols inside the comment makes it a little more difficult
for the email harvesting software to harvest emails from your web page.
Unfortunately, the drawback is that a user initiated mail client cannot be
brought up with this method.
Unicode characters, hexadecimal or decimal
entities
Another way to protect web pages from email
harvesting is to encode the email address into some language that the computer
can understand but not without some additional work.
<ahref="mailto:foo@example.
com">foo@example.com</a>
The Gibberish code provided above is
the same as the foo@example.com email address above in the Plain HTML code
section just in different language (decimal entities). Even though this
Gibberish code is not readable to human like this, it will be displayed as
foo@example.com at the website. The Gibberish code above is displayed by a
browser or email client exactly the same way like the nice mailto:foo@example.com.
Here is a page that tells you how this can be done in PHP: PHP loop through
string.
If you want to know how the Gibberish
code translates to readable letters, take a look at the ASCII table (dec 102 =
char f). Our ASCII to hex converter and dec to hex converter tools can help you
when setting this up.
Not a bad idea, however again this is
similar to above methods from an email address harvesting robot's perspective.
It can just as easily interpret the special character entities for the
characters. But, not every email address harvesting robot is programmed to do
this conversion. If you however combine a mix of unicode characters, decimal
and hexadecimal entities, you will be another step ahead.
Email address or its parts displayed as images
Another way to protect web pages from email
harvesting is to use a small image that contains either the full email address
or its parts. Even though obtaining information from an image is possible, only
a few email harvesting programs are capable of doing this. Obtaining your email
address from an image is resource costly and for email address harvested not
worth the effort.
foo example com
This makes the address unreadable to email
address harvesting robots but still semi-readable to visually impaired humans.
Other techniques
There are many more techniques that can be
used to protect web pages from email harvesting. You can find more on the next
page.
The following is a list of tips that can
help you to prevent email harvesting. Go to the link referenced above for more.
Email address in HTTP redirect
One way to prevent email address harvesting
is to write a server-side script to return the mailto:foo@example.com link as a
HTTP redirect. All modern browsers recognize mailto in the page header but not
every harvester is capable of understanding this. Here is an example showing
how this can be done in PHP. You display your email as for example:
<a
href="email_address.php">This is my email address, click
here.</a>
The content of the email_address.php file
is the following:
<?php
header ("Location:
mailto:foo@example.com");
exit();
?>
Remember that if you are running Apache,
you need to have the mod_rewrite module enabled for this to work. When the
visitor clicks the link in the A HREF, it will call the email_address.php file
which displays mailto:foo@example.com in his or her browser's address bar.
Email address and mailto as JavaScript.
This is another common technique to prevent
email address harvesting. Instead of using the plain <a
href=mailto:address></a> HTML tag, you would write out the same using
a JavaScript. There are numerous ways of doing this in JavaScript; however, the
concept is the same. The idea behind preventing email address harvesting is to
break the email address into parts which cannot be easily parsed from the
source code by the email address harvesting program. And the beauty of
JavaScript is that this can be done in many ways. The easiest script would be:
<script type="text/javascript">
document.write("foo@example.com")
</script>
This script displays the email address in
the browser. The email address is not clickable, and the nice thing is that it
does not include the mailto attribute which is what email address harvesting
programs are looking for. The bad thing is that the email address is still
visible and can be easily parsed if the spam bot is set up to look for the @
symbol. The next step in our security cook-book is to split the email address
into pieces.
<script
type="text/javascript">
document.write("foo" +
"@" + "example" + "com")
</script>
In this example, the email harvester
would need to be smart enough to join the individual strings and also to
translate the @ entity into the @ symbol. (Note, how did we come up
with the @? Take a look at the ASCII table and try our ASCII to hex
converter.)
If you are still worried that the
email harvester might strip out the " + " and reassemble the email
link, there is always something more you can do. Variables can have information
added to their existing contents and that new content can even include the
existing variable. If you wanted to make the code more complicated, you could
use something like the following:
<script type="text/javascript">
var string1 = "foo";
var string2 = "@";
var string3 = "example";
var string4 = ".";
var string5 = "com";
var string6 = string1 + string2 + string3 +
string4 + string5;
document.write("<a href=" +
"mail" + "to:" + string1 + string2 + string3 + string4 +
string5 + ">" + string6 + "</a>");
</script>
This looks pretty challenging for the email
harvesting program, does it not? You can even combine this method with email
faking and using images. Here is another example of what can be done in
JavaScript.
<a
href='javascript:window.location="mail"+"to:"+"foo"+"@"+"example"+"."+"com";'
onmouseover='window.status="mail"+"to:"+"foo"+"@"+"example"+"."+"com";
return true;'
onmouseout='window.status="";return
true;'>Click here to send mail.</a>
The drawback of the JavaScript method is
that the email address is visible on screen in browsers which support
JavaScript only. Those browsers that have JavaScript turned off or do not
support JavaScript would not display the email address at all.
Email address via CSS2 pseudo-element :after
Here is another great technique that you
can use to prevent email address harvesting. First, you would define your CSS
code:
.emailDiv:after { content: foo@example.com;
}
The class can be defined either in the HEAD
of your page or in your *.css file. You can of course substitute the @ symbol
with an entity. Once you have your CSS defined, you would display your email
address as follows:
<div style="emailDiv">This
is my email address: </div>
This code would be displayed in the browser
as This is my email address: foo@example.com. The dark side of this technique
is that only browsers that can interpret CSS2 will display the address. MSIE as
of the end of 2008 does not display this, Firefox works
Email address through CSS2 unicode-bidi (text direction)
Another technique to prevent email address
harvesting is based on changing the direction of the text. The key in this
email address harvesting prevention method is to change the direction of text
from left-to-right (default) to right-to-left. First, you would define your CSS
code:
div.codedirection { unicode-bidi:
bidi-override; direction: rtl; }
and then you would display the email
address on your page as
<div
class="codedirection">moc.elpmaxe@oof</div>
The browser will display the email address
as foo@example.com. The nice feature of this method is that you can have your
email addresses in your *.css file, it means separately from your HTML code.
This method will display the email backwards for those browsers without CSS2
support which could be quite bothersome to invert.
Stuff email address with CSS display:none
Display none is another nice technique to
prevent email address harvesting. In this case, we just interlace the email
address with some text that we later remove from the body of the email with
display none when rendering it in the browser. First, you would define your CSS:
.hideThisText { display:none; }
Now you would use this CSS class in your
text.
foo@bar<div
class="hideThisText">[REMOVETHIS]</div>.com
The browser would display the email address
as foo@example.com. Browsers that support the display:none property (most
browsers do) will not display the [REMOVETHIS] to the user. Those browsers that
do not support display:none will show the [REMOVETHIS] and the user will
hopefully remove it before sending email to the address. The email is textually
available to the user; however, the user cannot click a link in order to open
their email client.
Use forms for emails
If you want to completely prevent email
address harvesting, using forms is the best option. In this case, no email
address is displayed at the website. The user has to contact you by filling out
a form in which case a server-side scripting process forwards the data from the
form to your email. Your email address is very safe. Spam robots simply pass
this area as it contains no email address in the source code.
The disadvantage with this method is that
your form can get spammed with content spammers, but that is another story. You
would need to protect your form with captcha.
Which method is the best?
It depends. First of all, please note that
there are many variations to the above methods, and they can be combined to
produce a unique solution. Whatever you do at your website, please, keep in
mind to fully test your code in all major browsers and their versions too.
Remember that some methods are limited in
their accessibility. Web pages can be accessed not only through MSIE, Firefox,
Opera, and Netscape on a x86 based PC, but they can be accessed on MACs,
through Linux and Unix, and you might have some visitors through cell phones
and other hand-held devices too. Visually impaired people use text readers.
When choosing the right method for your
application, it does not necessarily need to be the one that is most
complicated. Theoretically, email harvesters could write code that can break or
decode every method listed here. If something can be engineered, it can always
be reverse-engineered. However, consider the size of the source code that harvester
would need to have to account for every method listed here and multiply that by
the number of sites/pages a bot has to go through in order to have a good
number of emails collected. Accounting for every method listed here would call
for utilization of extreme resources. So, most email address harvesting
programs are very easy and primitive to be small and fast. With minimal
measures, a greater portion of harvesters can be fooled.