Thursday, 19 July 2012

Email Harvesting-How to protect E-mail


Author: Gaurav Srivastava

What is email address harvesting?

Email harvesting is the process of obtaining lists of email addresses using various methods for use in bulk email or other purposes usually grouped as spam. Speaking of web sites, spammers have programs which spider through web pages looking for email addresses. Email address harvesting is done using special software known as "harvesting bots", "harvesting robots", or "harvesters" which crawl web pages and capture every email address they find Email address harvesting is bad because once your or your client's email address gets into spammers' lists, it will get flooded with spam and trash very quickly. And how it relates to you? If you run, design or build a web page, you need to take preventive steps to protect email addresses from getting harvested. (If not, you could even be held liable.) And if your email address is displayed somewhere, well you do not want to have to create a new email account soon just because it gets spammed, right.

Why would anyone want to harvest email addresses?

Spam. Phishing. Spoofing. Direct marketing. All these techniques are used with one goal - to sell you something or to conduct some illegal activity leading to getting some monetary or other benefit from you. If someone with bad intentions has your email address, you can become target of his or her bad intentions. Harvesting email addresses alone can make money as well. There are many spammers that just collect email address lists only to sell them to marketing companies.

There are many ways in which spammers can get your email address. The ones I know of are:

1.     From posts to UseNet with your email address.

2.     From mailing lists

3.     From web pages.

4.     From various web and paper forms.

5.     Via an I dent daemon.

6.     From a web browser.

7.     From IRC and chat rooms.

8.     From finger daemons.

9.     AOL profiles

10.  From domain contact points.

11.  By guessing & cleaning. 

12.  From white & yellow pages.

13.  By having access to the same computer.

14.  From a previous owner of the email address.

15.  Using social engineering.

16.  Buying lists from others.

17.  By hacking into sites.

 

            1. From posts to UseNet with your email address.


Spammers regularily scan UseNet for email address, using readymade programs designed to do so. Some programs just look at articles headers which contain email address (From:, Reply-To:, etc), while other programs check the articles’ bodies, starting with programs that look at signatures, through programs that take everything that contain a ‘@’ character and attempt to demunge munged email addresses.
There have been reports of spammers demunging email addresses on occasions, ranging from demunging a single address for purposes of revenge spamming to automatic methods that try to unmunge email addresses that were munged in some common ways, e.g. remove such strings as ‘nospam’ from email addresses.
As people who where spammed frequently report that spam frequency to their mailbox dropped sharply after a period in which they did not post to UseNet, as well as evidence to spammers’ chase after ‘fresh’ and ‘live’ addresses, this technique seems to be the primary source of email addresses for spammers.

            2. From mailing lists.


Spammers regularily attempt to get the lists of subscribers to mailing lists [some mail servers will give those upon request], knowing that the email addresses are unmunged and that only a few of the addresses are invalid.
When mail servers are configured to refuse such requests, another trick might be used - spammers might send an email to the mailing list with the headers Return-Receipt-To: or X-Confirm-Reading-To: . Those headers would cause some mail transfer agents and reading programs to send email back to the saying that the email was delivered to / read at a given email address, divulging it to spammers.
A different technique used by spammers is to request a mailing lists server to give him the list of all mailing lists it carries (an option implemented by some mailing list servers for the convenience of legitimate users), and then send the spam to the mailing list’s address, leaving the server to do the hard work of forwarding a copy to each subscribed email address.
[I know spammers use this trick from bad experience - some spammer used this trick on the list server of the company for which I work, easily covering most of the employees, including employees working well under a month and whose email addresses would be hard to find in other ways.

           3. From web pages.


Spammers have programs which spider through web pages, looking for email addresses, e.g. email addresses contained in mailto: HTML tags [those you can click on and get a mail window opened]
Some spammers even target their mail based on web pages. I’ve discovered a web page of mine appeared in Yahoo as some spammer harvested email addresses from each new page appearing in Yahoo and sent me a spam regarding that web page.
A widely used technique to fight this technique is the ‘poison’ CGI script. The script creates a page with several bogus email addresses and a link to itself. Spammers’ software visiting the page would harvest the bogus email addresses and follow up the link, entering an infinite loop polluting their lists with bogus email addresses.

4.
 From various web and paper forms.

Some sites request various details via forms, e.g. guest books & registrations forms. Spammers can get email addresses from those either because the form becomes available on the world wide web, or because the site sells / gives the emails list to others.
Some companies would sell / give email lists filled in on paper forms, e.g. organizers of conventions would make a list of participants’ email addresses, and sell it when it’s no longer needed.
Some spammers would actually type E-mail addresses from printed material, e.g. professional directories & conference proceedings.
Domain name registration forms are a favourite as well - addresses are most usually correct and updated, and people read the emails sent to them expecting important messages.

            5. Via an Ident daemon.

Many unix computers run a daemon (a program which runs in the background, initiated by the system administrator), intended to allow other computers to identify people who connect to them.
When a person surfs from such a computer connects to a web site or news server, the site or server can connect the person’s computer back and ask that daemon’s for the person’s email address.
Some chat clients on PCs behave similarily, so using IRC can cause an email address to be given out to spammers.

            6. From a web browser.

Some sites use various tricks to extract a surfer’s email address from the web browser, sometimes without the surfer noticing it.
Those techniques include :
1.     Making the browser fetch one of the page’s images through an anonymous FTP connection to the site. Some browsers would give the email address the user has configured into the browser as the password for the anonymous FTP account. A surfer not aware of this technique will not notice that the email address has leaked.
2.     Using JavaScript to make the browser send an email to a chosen email address with the email address configured into the browser. Some browsers would allow email to be sent when the mouse passes over some part of a page. Unless the browser is properly configured, no warning will be issued.
3.     Using the HTTP_FROM header that browsers send to the server. Some browsers pass a header with your email address to every web server you visit. 

It’s worth noting here that when one reads E-mail with a browser (or any mail reader that understands HTML), the reader should be aware of active content (Java applets, Javascript, VB, etc) as well as web bugs.
An E-mail containing HTML may contain a script that upon being read (or even the subject being highlighted) automatically sends E-mail to any E-mail addresses. A good example of this case is the Melissa virus. Such a script could send the spammer not only the reader’s E-mail address but all the addresses on the reader’s address book.

            7. From IRC and chat rooms.


Some IRC clients will give a user’s email address to anyone who cares to ask it. Many spammers harvest email addresses from IRC, knowing that those are ‘live’ addresses and send spam to those email addresses.
This method is used beside the annoying IRCbots that send messages interactively to IRC and chat rooms without attempting to recognize who is participating in the first place.
This is another major source of email addresses for spammers, especially as this is one of the first public activities newbies join, making it easy for spammers to harvest ‘fresh’ addresses of people who might have very little experience dealing with spam.
AOL chat rooms are the most popular of those - according to reports there’s a utility that can get the screen names of participants in AOL chat rooms. The utility is reported to be specialized for AOL due to two main reasons - AOL makes the list of the actively participating users’ screen names available and AOL users are considered prime targets by spammers due to the reputation of AOL as being the ISP of choice by newbies.

            8. From finger daemons.

Some finger daemons are set to be very friendly - a finger query asking for john@host will produce list info including login names for all people named John on that host. A query for @host will produce a list of all currently logged-on users.
Spammers use this information to get extensive users list from hosts, and of active accounts - ones which are ‘live’ and will read their mail soon enough to be really attractive spam targets.

            9. AOL,Google, Facebook, twitter, RSS feeds profiles.

Spammers harvest AOL names from user profiles lists, as it allows them to ‘target’ their mailing lists. Also, AOL has a name being the choice ISP of newbies, who might not know how to recognize scams or know how to handle spam.

           10. From domain contact points.


Every domain has one to three contact points - administration, technical, and billing. The contact point includes the email address of the contact person.
As the contact points are freely available, e.g. using the ‘whois’ command, spammers harvest the email addresses from the contact points for lists of domains (the list of domain is usually made available to the public by the domain registries). This is a tempting methods for spammers, as those email addresses are most usually valid and mail sent to it is being read regularily.

            11. By guessing & cleaning.

Some spammers guess email addresses, send a test message (or a real spam) to a list which includes the guessed addresses. Then they wait for either an error message to return by email, indicating that the email address is correct, or for a confirmation. A confirmation could be solicited by inserting non-standard but commonly used mail headers requesting that the delivery system and/or mail client send a confirmation of delivery or reading. No news are, of coures, good news for the spammer.

Specifically, the headers are –
Return-Receipt-To: Send a delivery confirmation
X-Confirm-Reading-To: Send a reading confirmation

Guessing could be done based on the fact that email addresses are based on people’s names, usually in commonly used ways (first.last @domain
 or an initial of one name followed / preceded by the other @domain)
Also, some email addresses are standard - postmaster is mandated by the RFCs for internet mail. Other common email addresses are postmaster, hostmaster, root [for unix hosts], etc.

           12. From white & yellow pages.


There are various sites that serve as white pages, sometimes named people finders web sites. Yellow pages now have an email directory on the web.
Those white/yellow pages contain addresses from various sources, e.g. from UseNet, but sometimes your E-mail address will be registered for you. Example - HotMail will add E-mail addresses to BigFoot by default, making new addresses available to the public.
Spammers go through those directories in order to get email addresses. Most directories prohibit email address harvesting by spammers, but as those databases have a large databases of email addresses + names, it’s a tempting target for spammers.

            13. By having access to the same computer.

If a spammer has an access to a computer, he can usually get a list of valid usernames (and therefore email addresses) on that computer.
On unix computers the users file (/etc/passwd) is commonly world readable, and the list of currently logged-in users is listed via the ‘who’ command.

           14. From a previous owner of the email address.

An email address might have been owned by someone else, who disposed of it. This might happen with dialup usernames at ISPs - somebody signs up for an ISP, has his/her email address harvested by spammers, and cancel the account. When somebody else signs up with the same ISP with the same username, spammers already know of it.
Similar things can happen with AOL screen names - somebody uses a screen name, gets tired of it, releases it. Later on somebody else might take the same screen name.

           15. Using social engineering.

This method means the spammer uses a hoax to convince people into giving him valid E-mail addresses.
A good example is Richard Douche’s “Free CD’s” chain letter. The letter promises a free CD for every person to whom the letter is forwarded to as long as it is CC’ed to Richard.
Richard claimed to be associated with Amazon and Music blvd, among other companies, who authorized him to make this offer. Yet he supplied no references to web pages and used a free E-mail address.
All Richard wanted was to get people to send him valid E-mail addresses in order to build a list of addresses to spam and/or sell.

           16. Buying lists from others.

This one covers two types of trades. The first type consists of buying a list of email addresses (often on CD) that were harvested via other methods, e.g. someone harvesting email addresses from UseNet and sells the list either to a company that wishes to advertise via email (sometimes passing off the list as that of people who opted-in for emailed advertisements) or to others who resell the list.
The second type consists of a company who got the email addresses legitimately (e.g. a magazine that asks subscribers for their email in order to keep in touch over the Internet) and sells the list for the extra income. This extends to selling of email addresses a company got via other means, e.g. people who just emailed the company with inquiries in any context.

            17. By hacking into sites.

I’ve heard rumours that sites that supply free email addresses were hacked in order to get the list of email addresses, somewhat like e-commerce sites being hacked to get a list of credit cards.

  
How to protect web pages from email harvesting - security tips

We already know that email address harvesting is not good and that spammers have software that searches through the web and looks for email addresses. Now let's take a look at how to protect web pages from email harvesting. The following is a list of methods to hide email addresses from the page source to minimize visibility against the email harvesting spam bots. Each method has its advantages and disadvantages, so it is up to you to decide which method suits your needs the most.
1.     Plain HTML code
2.     HTML comments
3.     Unicode characters, hexadecimal or decimal entities.
4.     Email address or its parts displayed as images.
5.     Email address in HTTP redirect.
6.     Email address and mailto as JavaScript.
7.     Email address via CSS2 pseudo-element :after.
8.     Email address through CSS2 Unicode- bidi (text direction).
9.     Stuff email address with CSS display: none.
10.  Use forms for emails

Plain HTML code

First, let's explain how email addresses are usually displayed at websites and then start with the easy stuff. Email addresses are often coded into web pages like the following example:
<a href="mailto:foo@example.com">foo@example.com</a>
This example produces clickable foo@example.com. If you click this email address, your mail client (i.e. Outlook) will open up with this email address in the To: field. This email format is a beauty for email harvesting software, this is exactly what they are looking for and where they get majority of email addresses.
Some people make the job for email address harvesting software by writing out the email address as shown in the following two examples.
<a href="mailto:foo@example.com">foo[AT]example[DOT]com</a> and foo[AT]example[DOT]com
This is a bit better than the plain HTML format but notice that the first example still includes your correct email address in the mailto field, so email harvesting software still can find you. The second option leaves out the A HREF tag, so the link will not be clickable anymore and the visitor will have to copy your email address and paste it into his or her email client. Substituting @ with [AT] and dot with [DOT] is a nice idea but there is nothing easier than telling the email harvesting software "if you find [AT], replace it with @".
Fake email address or switched domains
A good way to protect your email address in a web page is to fake it for the email harvesting robot and let the human know that it has been faked.
<a href=mailto:foo@example[REMOVETHIS].com>foo@example[REMOVETHIS].com</a> or
<a href=mailto:foo@com.example>foo@com.example</a>
 These examples are not bad, but you have to really let the visitor know that he or she needs to fix the email address before sending email to it. Many people just blindly click, copy, past, so you really have to make this visible (perhaps by displaying the [REMOVETHIS] in red color or formatting with a strikethrough line). This email harvesting protection technique works well against email harvesting bots because even though they get the email, it is an invalid one, hence you are safe. On the other hand, emails in this format may cause confusion to the user, if the idea is not described well.

HTML comments

You can also protect web pages from email harvesting by enclosing individual email address parts with HTML comments.
foo<!-- >@. -->@<!-- >@. -->example<!-- >@. -->.<!-- >@. -->com
This would be displayed as foo@example.com. Placing the @, ., and > symbols inside the comment makes it a little more difficult for the email harvesting software to harvest emails from your web page. Unfortunately, the drawback is that a user initiated mail client cannot be brought up with this method.
Unicode characters, hexadecimal or decimal entities
Another way to protect web pages from email harvesting is to encode the email address into some language that the computer can understand but not without some additional work.
<ahref="mailto:&#102;&#111;&#111;&#64;&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#46; &#99;&#111;&#109;">&#102;&#111;&#111;&#64;&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#46;&#99;&#111;&#109;</a>
 The Gibberish code provided above is the same as the foo@example.com email address above in the Plain HTML code section just in different language (decimal entities). Even though this Gibberish code is not readable to human like this, it will be displayed as foo@example.com at the website. The Gibberish code above is displayed by a browser or email client exactly the same way like the nice mailto:foo@example.com. Here is a page that tells you how this can be done in PHP: PHP loop through string.
 If you want to know how the Gibberish code translates to readable letters, take a look at the ASCII table (dec 102 = char f). Our ASCII to hex converter and dec to hex converter tools can help you when setting this up.
 Not a bad idea, however again this is similar to above methods from an email address harvesting robot's perspective. It can just as easily interpret the special character entities for the characters. But, not every email address harvesting robot is programmed to do this conversion. If you however combine a mix of unicode characters, decimal and hexadecimal entities, you will be another step ahead.

Email address or its parts displayed as images

Another way to protect web pages from email harvesting is to use a small image that contains either the full email address or its parts. Even though obtaining information from an image is possible, only a few email harvesting programs are capable of doing this. Obtaining your email address from an image is resource costly and for email address harvested not worth the effort.
The email address is shown as an image. Although this method is very effective, it has some major disadvantages too. Only user-agents that can render the image properly will display the email address. Visually impaired users may not be able to obtain the email address. And, if you have thousands of visitors per day, this can be a performance issue as well. You can mitigate some of these disadvantages by substituting only the AT and DOT with images.
foo example com
This makes the address unreadable to email address harvesting robots but still semi-readable to visually impaired humans.

Other techniques

There are many more techniques that can be used to protect web pages from email harvesting. You can find more on the next page.
The following is a list of tips that can help you to prevent email harvesting. Go to the link referenced above for more.

Email address in HTTP redirect

One way to prevent email address harvesting is to write a server-side script to return the mailto:foo@example.com link as a HTTP redirect. All modern browsers recognize mailto in the page header but not every harvester is capable of understanding this. Here is an example showing how this can be done in PHP. You display your email as for example:
<a href="email_address.php">This is my email address, click here.</a>
The content of the email_address.php file is the following:
<?php
header ("Location: mailto:foo@example.com");
exit();
?>
Remember that if you are running Apache, you need to have the mod_rewrite module enabled for this to work. When the visitor clicks the link in the A HREF, it will call the email_address.php file which displays mailto:foo@example.com in his or her browser's address bar.
If the visitor's computer is set up properly, its mail application should be able to capture the email address and populate the To: field with it.
The advantage of this approach is that the email address is not directly visible at the web page, but, theoretically, some harvesters might be able to get the email address from the page header.


Email address and mailto as JavaScript.

This is another common technique to prevent email address harvesting. Instead of using the plain <a href=mailto:address></a> HTML tag, you would write out the same using a JavaScript. There are numerous ways of doing this in JavaScript; however, the concept is the same. The idea behind preventing email address harvesting is to break the email address into parts which cannot be easily parsed from the source code by the email address harvesting program. And the beauty of JavaScript is that this can be done in many ways. The easiest script would be:
<script type="text/javascript">
 document.write("foo@example.com")
</script>

This script displays the email address in the browser. The email address is not clickable, and the nice thing is that it does not include the mailto attribute which is what email address harvesting programs are looking for. The bad thing is that the email address is still visible and can be easily parsed if the spam bot is set up to look for the @ symbol. The next step in our security cook-book is to split the email address into pieces.
 <script type="text/javascript">
 document.write("foo" + "&#x0040" + "example" + "com")
</script>
 In this example, the email harvester would need to be smart enough to join the individual strings and also to translate the &#x0040 entity into the @ symbol. (Note, how did we come up with the &#x0040? Take a look at the ASCII table and try our ASCII to hex converter.)
 If you are still worried that the email harvester might strip out the " + " and reassemble the email link, there is always something more you can do. Variables can have information added to their existing contents and that new content can even include the existing variable. If you wanted to make the code more complicated, you could use something like the following:
<script type="text/javascript">
var string1 = "foo";
var string2 = "@";
var string3 = "example";
var string4 = ".";
var string5 = "com";
var string6 = string1 + string2 + string3 + string4 + string5;
document.write("<a href=" + "mail" + "to:" + string1 + string2 + string3 + string4 + string5 + ">" + string6 + "</a>");
</script>

This looks pretty challenging for the email harvesting program, does it not? You can even combine this method with email faking and using images. Here is another example of what can be done in JavaScript.
 <a href='javascript:window.location="mail"+"to:"+"foo"+"@"+"example"+"."+"com";'
onmouseover='window.status="mail"+"to:"+"foo"+"@"+"example"+"."+"com"; return true;'
onmouseout='window.status="";return true;'>Click here to send mail.</a>

The drawback of the JavaScript method is that the email address is visible on screen in browsers which support JavaScript only. Those browsers that have JavaScript turned off or do not support JavaScript would not display the email address at all.

Email address via CSS2 pseudo-element :after

Here is another great technique that you can use to prevent email address harvesting. First, you would define your CSS code:
.emailDiv:after { content: foo@example.com; }
The class can be defined either in the HEAD of your page or in your *.css file. You can of course substitute the @ symbol with an entity. Once you have your CSS defined, you would display your email address as follows:
<div style="emailDiv">This is my email address: </div>
This code would be displayed in the browser as This is my email address: foo@example.com. The dark side of this technique is that only browsers that can interpret CSS2 will display the address. MSIE as of the end of 2008 does not display this, Firefox works

Email address through CSS2 unicode-bidi (text direction)

Another technique to prevent email address harvesting is based on changing the direction of the text. The key in this email address harvesting prevention method is to change the direction of text from left-to-right (default) to right-to-left. First, you would define your CSS code:
div.codedirection { unicode-bidi: bidi-override; direction: rtl; }
and then you would display the email address on your page as
<div class="codedirection">moc.elpmaxe@oof</div>
The browser will display the email address as foo@example.com. The nice feature of this method is that you can have your email addresses in your *.css file, it means separately from your HTML code. This method will display the email backwards for those browsers without CSS2 support which could be quite bothersome to invert.
Stuff email address with CSS display:none
Display none is another nice technique to prevent email address harvesting. In this case, we just interlace the email address with some text that we later remove from the body of the email with display none when rendering it in the browser. First, you would define your CSS:
.hideThisText { display:none; }
Now you would use this CSS class in your text.
foo@bar<div class="hideThisText">[REMOVETHIS]</div>.com
The browser would display the email address as foo@example.com. Browsers that support the display:none property (most browsers do) will not display the [REMOVETHIS] to the user. Those browsers that do not support display:none will show the [REMOVETHIS] and the user will hopefully remove it before sending email to the address. The email is textually available to the user; however, the user cannot click a link in order to open their email client.

Use forms for emails

If you want to completely prevent email address harvesting, using forms is the best option. In this case, no email address is displayed at the website. The user has to contact you by filling out a form in which case a server-side scripting process forwards the data from the form to your email. Your email address is very safe. Spam robots simply pass this area as it contains no email address in the source code.
The disadvantage with this method is that your form can get spammed with content spammers, but that is another story. You would need to protect your form with captcha.

Which method is the best?

It depends. First of all, please note that there are many variations to the above methods, and they can be combined to produce a unique solution. Whatever you do at your website, please, keep in mind to fully test your code in all major browsers and their versions too.
Remember that some methods are limited in their accessibility. Web pages can be accessed not only through MSIE, Firefox, Opera, and Netscape on a x86 based PC, but they can be accessed on MACs, through Linux and Unix, and you might have some visitors through cell phones and other hand-held devices too. Visually impaired people use text readers.
When choosing the right method for your application, it does not necessarily need to be the one that is most complicated. Theoretically, email harvesters could write code that can break or decode every method listed here. If something can be engineered, it can always be reverse-engineered. However, consider the size of the source code that harvester would need to have to account for every method listed here and multiply that by the number of sites/pages a bot has to go through in order to have a good number of emails collected. Accounting for every method listed here would call for utilization of extreme resources. So, most email address harvesting programs are very easy and primitive to be small and fast. With minimal measures, a greater portion of harvesters can be fooled.